[jira] [Resolved] (FLINK-2034) Add vision and roadmap for ML library to docs
[ https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2034. -- Resolution: Fixed Added via b602b2ee1c9d130e97e844572f9827b29fbd9cf8 Add vision and roadmap for ML library to docs - Key: FLINK-2034 URL: https://issues.apache.org/jira/browse/FLINK-2034 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Labels: ML Fix For: 0.9 We should have a document describing the vision of the Machine Learning library in Flink and an up to date roadmap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-1992) Add convergence criterion to SGD optimizer
[ https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-1992. Resolution: Fixed Added via b3b6a9da0532d884f7d633530c11cda15aa6bc1b Add convergence criterion to SGD optimizer -- Key: FLINK-1992 URL: https://issues.apache.org/jira/browse/FLINK-1992 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Theodore Vasiloudis Priority: Minor Labels: ML Fix For: 0.9 Currently, Flink's SGD optimizer runs for a fixed number of iterations. It would be good to support a dynamic convergence criterion, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2050) Add pipelining mechanism for chainable transformers and estimators
[ https://issues.apache.org/jira/browse/FLINK-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2050. Resolution: Fixed Added via fde0341fe16c7258e42f77e289a557157995830c Add pipelining mechanism for chainable transformers and estimators -- Key: FLINK-2050 URL: https://issues.apache.org/jira/browse/FLINK-2050 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Till Rohrmann Labels: ML Fix For: 0.9 The key concept of an easy to use ML library is the quick and simple construction of data analysis pipelines. Scikit-learn's approach to define transformers and estimators seems to be a really good solution to this problem. I propose to follow a similar path, because it makes FlinkML flexible in terms of code reuse as well as easy for people coming from Scikit-learn to use the FlinkML. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1999) TF-IDF transformer
[ https://issues.apache.org/jira/browse/FLINK-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558224#comment-14558224 ] Till Rohrmann commented on FLINK-1999: -- You can do it as you like and what's easier for you. TF-IDF transformer -- Key: FLINK-1999 URL: https://issues.apache.org/jira/browse/FLINK-1999 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Ronny Bräunlich Assignee: Alexander Alexandrov Priority: Minor Labels: ML Hello everybody, we are a group of three students from TU Berlin (I guess we're not the first group creating an issue) and we want to/have to implement a tf-idf tranformer for Flink. Our lecturer Alexander told us that we could get some guidance here and that you could point us to an old version of a similar tranformer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2056) Add guide to create a chainable predictor in docs
[ https://issues.apache.org/jira/browse/FLINK-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2056. -- Resolution: Fixed Added with 48e2cb5e8be7c4f305b947fb25ea7d312844e032 Add guide to create a chainable predictor in docs - Key: FLINK-2056 URL: https://issues.apache.org/jira/browse/FLINK-2056 Project: Flink Issue Type: Sub-task Components: Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Priority: Minor Fix For: 0.9 The upcoming API for pipelines should have good documentation to guide and encourage the implementation of more algorithms. For this task we will create a guide that shows how the pipeline mechanism works through Scala implicits, and a full guide to implementing a chainable predictor, using Generalized Linear Models as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2090) toString of CollectionInputFormat takes long time when the collection is huge
Till Rohrmann created FLINK-2090: Summary: toString of CollectionInputFormat takes long time when the collection is huge Key: FLINK-2090 URL: https://issues.apache.org/jira/browse/FLINK-2090 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Priority: Minor The {{toString}} method of {{CollectionInputFormat}} calls {{toString}} on its underlying {{Collection}}. Thus, {{toString}} is called for each element of the collection. If the {{Collection}} contains many elements or the individual {{toString}} calls for each element take a long time, then the string generation can take a considerable amount of time. [~mikiobraun] noticed that when he inserted several jBLAS matrices into Flink. The {{toString}} method is mainly used for logging statements in {{DataSourceNode}}'s {{computeOperatorSpecificDefaultEstimates}} method and in {{JobGraphGenerator.getDescriptionForUserCode}}. I'm wondering whether it is necessary to print the complete content of the underlying {{Collection}} or if it's not enough to print only the first 3 elements in the {{toString}} method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2053) Preregister ML types for Kryo serialization
[ https://issues.apache.org/jira/browse/FLINK-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2053. Resolution: Fixed Added with ae446388b91ecc0f08887da19400395b96b32f6c Preregister ML types for Kryo serialization --- Key: FLINK-2053 URL: https://issues.apache.org/jira/browse/FLINK-2053 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Till Rohrmann Labels: ML Fix For: 0.9 Currently, FlinkML uses interfaces and abstract types to implement generic algorithms. As a consequence we have to use Kryo to serialize the effective subtypes. In order to speed the data transfer up, it's necessary to preregister these types in order to assign them fixed IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2073) Add contribution guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-2073: Assignee: Till Rohrmann Add contribution guide for FlinkML -- Key: FLINK-2073 URL: https://issues.apache.org/jira/browse/FLINK-2073 Project: Flink Issue Type: New Feature Components: Documentation, Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Till Rohrmann Fix For: 0.9 We need a guide for contributions to FlinkML in order to encourage the extension of the library, and provide guidelines for developers. One thing that should be included is a step-by-step guide to create a transformer, or other Estimator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2075) Shade akka and protobuf dependencies away
Till Rohrmann created FLINK-2075: Summary: Shade akka and protobuf dependencies away Key: FLINK-2075 URL: https://issues.apache.org/jira/browse/FLINK-2075 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Fix For: 0.9 Lately, the Zeppelin project encountered the following problem: It includes flink-runtime which depends on akka_remote:2.3.7 which again depends on protobuf-java:2.5.0. However, Zeppelin set the protobuf-java version to 2.4.1 to make it build with YARN 2.2. Due to this, akka_remote finds a wrong protobuf-java version and fails because of an incompatible change between these versions. I propose to shade Flink's akka dependency and protobuf dependency away, so that user projects depending on Flink are not forced to use a special akka/protobuf version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2035) Update 0.9 roadmap with ML issues
[ https://issues.apache.org/jira/browse/FLINK-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2035. -- Resolution: Fixed Marked corresponding jira issues and added them to the google doc. Update 0.9 roadmap with ML issues - Key: FLINK-2035 URL: https://issues.apache.org/jira/browse/FLINK-2035 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Till Rohrmann Fix For: 0.9 The [current list|https://issues.apache.org/jira/browse/FLINK-2001?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%200.9%20AND%20component%20%3D%20%22Machine%20Learning%20Library%22] of issues linked with the 0.9 release is quite limited. We should go through the current ML issues and assign fix versions, so that we have a clear view of what we expect to have in 0.9. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2083) Ensure high quality docs for FlinkML in 0.9
[ https://issues.apache.org/jira/browse/FLINK-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2083. -- Resolution: Fixed Improved with b015a32f6126d759fc6dee90b78f90f7ff8dfbac Ensure high quality docs for FlinkML in 0.9 --- Key: FLINK-2083 URL: https://issues.apache.org/jira/browse/FLINK-2083 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Labels: ML Fix For: 0.9 As defined in our vision for FlinkML, providing high-quality documentation is a primary goal for us. This issue concerns the docs that will be included in 0.9, and will track improvements and additions for the release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-1745: - Assignee: (was: Till Rohrmann) Add exact k-nearest-neighbours algorithm to machine learning library Key: FLINK-1745 URL: https://issues.apache.org/jira/browse/FLINK-1745 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Labels: ML, Starter Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial it is still used as a mean to classify data and to do regression. This issue focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as proposed in [2]. Could be a starter task. Resources: [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2521) Add automatic test name logging for tests
[ https://issues.apache.org/jira/browse/FLINK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2521. Resolution: Fixed Added via 2f0412f163f4d37605188c8cc763111e0b51f0dc Add automatic test name logging for tests - Key: FLINK-2521 URL: https://issues.apache.org/jira/browse/FLINK-2521 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Till Rohrmann Priority: Minor When running tests on travis the Flink components log to a file. This is helpful in case of a failed test to retrieve the error. However, the log does not contain the test name and the reason for the failure. Therefore it is difficult to find the log output which corresponds to the failed test. It would be nice to automatically add the test case information to the log. This would ease the debugging process big time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1994) Add different gain calculation schemes to SGD
[ https://issues.apache.org/jira/browse/FLINK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701225#comment-14701225 ] Till Rohrmann commented on FLINK-1994: -- Sounds great [~rawkintrevo]. When it's ready, then you should open a PR against Flink's master branch on github. Add different gain calculation schemes to SGD - Key: FLINK-1994 URL: https://issues.apache.org/jira/browse/FLINK-1994 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Trevor Grant Priority: Minor Labels: ML, Starter The current SGD implementation uses as gain for the weight updates the formula {{stepsize/sqrt(iterationNumber)}}. It would be good to make the gain calculation configurable and to provide different strategies for that. For example: * stepsize/(1 + iterationNumber) * stepsize*(1 + regularization * stepsize * iterationNumber)^(-3/4) See also how to properly select the gains [1]. Resources: [1] http://arxiv.org/pdf/1107.2490.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702892#comment-14702892 ] Till Rohrmann commented on FLINK-2366: -- Do you mind taking the lead for the alternative HA service implementations [~sirinath19...@gmail.com]? HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703052#comment-14703052 ] Till Rohrmann commented on FLINK-2366: -- Great to hear [~sirinath19...@gmail.com] :-) If you take a look at the PR #1016 on github (https://github.com/apache/flink/pull/1016) you'll find the definition of the SPI. Most notably the {{LeaderElectionService}} and the {{LeaderRetrievalService}} interfaces will be of your interest. Those are the services which you have to implement to add a new backend for HA. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-2549) Add topK operator for DataSet
[ https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704565#comment-14704565 ] Till Rohrmann edited comment on FLINK-2549 at 8/20/15 9:17 AM: --- I agree with [~StephanEwen]. Sorting the complete input with n elements has a complexity of O(n * log( n )) whereas keeping the k top most elements in a priority queue gives you in worst case O(n * log( k )). Assuming k n, then this is worth the effort. was (Author: till.rohrmann): I agree with [~StephanEwen]. Sorting the complete input with n elements has a complexity of O(n * log(n)) whereas keeping the k top most elements in a priority queue gives you in worst case O(n * log(k)). Assuming k n, then this is worth the effort. Add topK operator for DataSet - Key: FLINK-2549 URL: https://issues.apache.org/jira/browse/FLINK-2549 Project: Flink Issue Type: New Feature Components: Core, Java API, Scala API Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor topK is a common operation for user, it would be great to have it in Flink. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2549) Add topK operator for DataSet
[ https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704565#comment-14704565 ] Till Rohrmann commented on FLINK-2549: -- I agree with [~StephanEwen]. Sorting the complete input with n elements has a complexity of O(n * log(n)) whereas keeping the k top most elements in a priority queue gives you in worst case O(n * log(k)). Assuming k n, then this is worth the effort. Add topK operator for DataSet - Key: FLINK-2549 URL: https://issues.apache.org/jira/browse/FLINK-2549 Project: Flink Issue Type: New Feature Components: Core, Java API, Scala API Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor topK is a common operation for user, it would be great to have it in Flink. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2549) Add topK operator for DataSet
[ https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706744#comment-14706744 ] Till Rohrmann commented on FLINK-2549: -- Is required to implement sample operator which works on Flink's managed memory. Add topK operator for DataSet - Key: FLINK-2549 URL: https://issues.apache.org/jira/browse/FLINK-2549 Project: Flink Issue Type: New Feature Components: Core, Java API, Scala API Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor topK is a common operation for user, it would be great to have it in Flink. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-1901. -- Resolution: Fixed Added via c9cfb17cb095def8b8ea0ed1b598fc78b890b874 A fixed size sample operator working on Flink's managed memory will be implemented with FLINK-2549. Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative or exact size of the sample, set a seed for reproducibility, and support sampling within iterations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706360#comment-14706360 ] Till Rohrmann commented on FLINK-1745: -- Sorry for the late reply, but I want to review the PR first. Add exact k-nearest-neighbours algorithm to machine learning library Key: FLINK-1745 URL: https://issues.apache.org/jira/browse/FLINK-1745 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Labels: ML, Starter Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial it is still used as a mean to classify data and to do regression. This issue focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as proposed in [2]. Could be a starter task. Resources: [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2544) Some test cases using PowerMock fail with Java 8u20
Till Rohrmann created FLINK-2544: Summary: Some test cases using PowerMock fail with Java 8u20 Key: FLINK-2544 URL: https://issues.apache.org/jira/browse/FLINK-2544 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Priority: Minor I observed that some of the test cases using {{PowerMockRunner}} fail with Java 8u20 with the following error: {code} java.lang.VerifyError: Bad init method call from inside of a branch Exception Details: Location: org/apache/flink/client/program/ClientTest$SuccessReturningActor.init()V @32: invokespecial Reason: Error exists in the bytecode Bytecode: 0x000: 2a4c 1214 b800 1a03 bd00 0d12 1bb8 001f 0x010: b800 254e 2db2 0029 a500 0e2a 01c0 002b 0x020: b700 2ea7 0009 2bb7 0030 0157 2a01 4c01 0x030: 4d01 4e2b 01a5 0008 2b4e a700 0912 32b8 0x040: 001a 4e2d 1234 03bd 000d 1236 b800 1f12 0x050: 32b8 003a 3a04 1904 b200 29a6 000a b800 0x060: 3c4d a700 0919 04c0 0011 4d2c b800 42b5 0x070: 0046 b1 Stackmap Table: full_frame(@38,{UninitializedThis,UninitializedThis,Top,Object[#13]},{}) full_frame(@44,{Object[#2],Object[#2],Top,Object[#13]},{}) full_frame(@61,{Object[#2],Null,Null,Null},{Object[#2]}) full_frame(@67,{Object[#2],Null,Null,Object[#15]},{Object[#2]}) full_frame(@101,{Object[#2],Null,Null,Object[#15],Object[#13]},{Object[#2]}) full_frame(@107,{Object[#2],Null,Object[#17],Object[#15],Object[#13]},{Object[#2]}) at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2658) at java.lang.Class.getConstructor0(Class.java:3062) at java.lang.Class.getDeclaredConstructor(Class.java:2165) at akka.util.Reflect$$anonfun$4.apply(Reflect.scala:86) at akka.util.Reflect$$anonfun$4.apply(Reflect.scala:86) at scala.util.Try$.apply(Try.scala:161) at akka.util.Reflect$.findConstructor(Reflect.scala:86) at akka.actor.NoArgsReflectConstructor.init(Props.scala:359) at akka.actor.IndirectActorProducer$.apply(Props.scala:308) at akka.actor.Props.producer(Props.scala:176) at akka.actor.Props.init(Props.scala:189) at akka.actor.Props$.create(Props.scala:99) at akka.actor.Props$.create(Props.scala:99) at akka.actor.Props.create(Props.scala) at org.apache.flink.client.program.ClientTest.shouldSubmitToJobClient(ClientTest.java:143) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:68) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.runTestMethod(PowerMockJUnit44RunnerDelegateImpl.java:310) at org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:88) at org.junit.internal.runners.MethodRoadie.runBeforesThenTestThenAfters(MethodRoadie.java:96) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.executeTest(PowerMockJUnit44RunnerDelegateImpl.java:294) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.executeTestInSuper(PowerMockJUnit47RunnerDelegateImpl.java:127) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.executeTest(PowerMockJUnit47RunnerDelegateImpl.java:82) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.runBeforesThenTestThenAfters(PowerMockJUnit44RunnerDelegateImpl.java:282) at org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:86) at org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:49) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.invokeTestMethod(PowerMockJUnit44RunnerDelegateImpl.java:207) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.runMethods(PowerMockJUnit44RunnerDelegateImpl.java:146) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$1.run(PowerMockJUnit44RunnerDelegateImpl.java:120) at org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:33) at org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:45) at org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.run(PowerMockJUnit44RunnerDelegateImpl.java:118) at
[jira] [Updated] (FLINK-2564) Failing Test: RandomSamplerTest
[ https://issues.apache.org/jira/browse/FLINK-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-2564: - Labels: test-stability (was: ) Failing Test: RandomSamplerTest --- Key: FLINK-2564 URL: https://issues.apache.org/jira/browse/FLINK-2564 Project: Flink Issue Type: Bug Reporter: Matthias J. Sax Labels: test-stability {noformat} Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 15.943 sec FAILURE! - in org.apache.flink.api.java.sampling. testPoissonSamplerFraction(org.apache.flink.api.java.sampling.RandomSamplerTest) Time elapsed: 0.017 sec FAILURE! java.lang.AssertionError: expected fraction: 0.01, result fraction: 0.011300 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.api.java.sampling.RandomSamplerTest.verifySamplerFraction(RandomSamplerTest.java:249) at org.apache.flink.api.java.sampling.RandomSamplerTest.testPoissonSamplerFraction(RandomSamplerTest.java:116) Results : Failed tests: Successfully installed excon-0.33.0 RandomSamplerTest.testPoissonSamplerFraction:116-verifySamplerFraction:249 expected fraction: 0.01, result fraction: 0.011300 {noformat} Full log: https://travis-ci.org/apache/flink/jobs/76720572 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704520#comment-14704520 ] Till Rohrmann commented on FLINK-2366: -- I'll let you know when this happened :-) HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2066) Make delay between execution retries configurable
[ https://issues.apache.org/jira/browse/FLINK-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-2066: - Assignee: Nuno Miguel Marques dos Santos Make delay between execution retries configurable - Key: FLINK-2066 URL: https://issues.apache.org/jira/browse/FLINK-2066 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Nuno Miguel Marques dos Santos Labels: starter Flink allows to specify a delay between execution retries. This helps to let some external failure causes fully manifest themselves before the restart is attempted. The delay is currently defined only system wide. We should add it to the {{ExecutionConfig}} of a job to allow per-job specification. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2289) Make JobManager highly available
Till Rohrmann created FLINK-2289: Summary: Make JobManager highly available Key: FLINK-2289 URL: https://issues.apache.org/jira/browse/FLINK-2289 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Till Rohrmann Currently, the {{JobManager}} is the single point of failure in the Flink system. If it fails, then your job cannot be recovered and the Flink cluster is no longer able to receive new jobs. Therefore, it is crucial to make the {{JobManager}} fault tolerant so that the Flink cluster can recover from failed {{JobManager}}. As a first step towards this goal, I propose to make the {{JobManager}} highly available by starting multiple instances and using Apache ZooKeeper to elect a leader. The leader is responsible for the execution of the Flink job. In case that the {{JobManager}} dies, one of the other running {{JobManager}} will be elected as the leader and take over the role of the leader. The {{Client}} and the {{TaskManager}} will automatically detect the new {{JobManager}} by querying the ZooKeeper cluster. Note that this does not achieve full fault tolerance for the {{JobManager}} but it allows the cluster to recover from failed {{JobManager}}. The design of high-availability for the {{JobManager}} is tracked in the wiki here [1]. Resources: [1] [https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2278) SparseVector created from Breeze Sparsevector has wrong size
[ https://issues.apache.org/jira/browse/FLINK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-2278: Assignee: Till Rohrmann SparseVector created from Breeze Sparsevector has wrong size Key: FLINK-2278 URL: https://issues.apache.org/jira/browse/FLINK-2278 Project: Flink Issue Type: Bug Reporter: Christoph Alt Assignee: Till Rohrmann Labels: ML The following code doesn't return true when testing equality of two SparseVectors, one converted from a Breeze SparseVector. {code} val flinkVector = SparseVector.fromCOO(3, (1, 1.0), (2, 2.0)) val breezeVector = linalg.SparseVector(3)(1 - 1.0, 2 - 2.0) flinkVector.equalsVector(breezeVector.fromBreeze) {code} The reason is that *fromBreeze* takes the number of non-zero elements *SparseVector.used* as size when creating a SparseVector instead of the dimensionality *SparseVector.length*. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2278) SparseVector created from Breeze Sparsevector has wrong size
[ https://issues.apache.org/jira/browse/FLINK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2278. -- Resolution: Fixed Fixed via bc3684e69aff1a73f7fb3a62b097e9fbb311cd71 SparseVector created from Breeze Sparsevector has wrong size Key: FLINK-2278 URL: https://issues.apache.org/jira/browse/FLINK-2278 Project: Flink Issue Type: Bug Reporter: Christoph Alt Labels: ML The following code doesn't return true when testing equality of two SparseVectors, one converted from a Breeze SparseVector. {code} val flinkVector = SparseVector.fromCOO(3, (1, 1.0), (2, 2.0)) val breezeVector = linalg.SparseVector(3)(1 - 1.0, 2 - 2.0) flinkVector.equalsVector(breezeVector.fromBreeze) {code} The reason is that *fromBreeze* takes the number of non-zero elements *SparseVector.used* as size when creating a SparseVector instead of the dimensionality *SparseVector.length*. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1994) Add different gain calculation schemes to SGD
[ https://issues.apache.org/jira/browse/FLINK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-1994: - Assignee: Trevor Grant Add different gain calculation schemes to SGD - Key: FLINK-1994 URL: https://issues.apache.org/jira/browse/FLINK-1994 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Trevor Grant Priority: Minor Labels: ML, Starter The current SGD implementation uses as gain for the weight updates the formula {{stepsize/sqrt(iterationNumber)}}. It would be good to make the gain calculation configurable and to provide different strategies for that. For example: * stepsize/(1 + iterationNumber) * stepsize*(1 + regularization * stepsize * iterationNumber)^(-3/4) See also how to properly select the gains [1]. Resources: [1] http://arxiv.org/pdf/1107.2490.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1994) Add different gain calculation schemes to SGD
[ https://issues.apache.org/jira/browse/FLINK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649172#comment-14649172 ] Till Rohrmann commented on FLINK-1994: -- Great to hear [~rawkintrevo]. I assigned the issue to you. Add different gain calculation schemes to SGD - Key: FLINK-1994 URL: https://issues.apache.org/jira/browse/FLINK-1994 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Trevor Grant Priority: Minor Labels: ML, Starter The current SGD implementation uses as gain for the weight updates the formula {{stepsize/sqrt(iterationNumber)}}. It would be good to make the gain calculation configurable and to provide different strategies for that. For example: * stepsize/(1 + iterationNumber) * stepsize*(1 + regularization * stepsize * iterationNumber)^(-3/4) See also how to properly select the gains [1]. Resources: [1] http://arxiv.org/pdf/1107.2490.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2478) The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link
[ https://issues.apache.org/jira/browse/FLINK-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653209#comment-14653209 ] Till Rohrmann commented on FLINK-2478: -- Thanks for reporting the broken link Slim. Will fix it immediately. The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link - Key: FLINK-2478 URL: https://issues.apache.org/jira/browse/FLINK-2478 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 0.10 Reporter: Slim Baltagi Assignee: Till Rohrmann Priority: Minor Note that FlinkML is currently not part of the binary distribution. See linking with it for cluster execution here. 'here' links to a dead link: https://ci.apache.org/projects/flink/flink-docs-master/libs/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution The correct link is: https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2478) The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link
[ https://issues.apache.org/jira/browse/FLINK-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-2478: Assignee: Till Rohrmann The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link - Key: FLINK-2478 URL: https://issues.apache.org/jira/browse/FLINK-2478 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 0.10 Reporter: Slim Baltagi Assignee: Till Rohrmann Priority: Minor Note that FlinkML is currently not part of the binary distribution. See linking with it for cluster execution here. 'here' links to a dead link: https://ci.apache.org/projects/flink/flink-docs-master/libs/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution The correct link is: https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2478) The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link
[ https://issues.apache.org/jira/browse/FLINK-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2478. -- Resolution: Fixed Fixed via 77b7471580ce9cada86e32c2b6919086ed2eb730 The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link - Key: FLINK-2478 URL: https://issues.apache.org/jira/browse/FLINK-2478 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 0.10 Reporter: Slim Baltagi Assignee: Till Rohrmann Priority: Minor Note that FlinkML is currently not part of the binary distribution. See linking with it for cluster execution here. 'here' links to a dead link: https://ci.apache.org/projects/flink/flink-docs-master/libs/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution The correct link is: https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2469) JobManager crashes on Cancel
[ https://issues.apache.org/jira/browse/FLINK-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2469. Resolution: Duplicate FLINK-2409 already covers the issue. JobManager crashes on Cancel Key: FLINK-2469 URL: https://issues.apache.org/jira/browse/FLINK-2469 Project: Flink Issue Type: Bug Reporter: Matthias J. Sax In local mode, JobManager crashes if job is canceled via JobManger-WebFrontend. The log shows the following error: 13:19:34,722 ERROR akka.actor.OneForOneStrategy - Received a message CancelJob(948b32f3fb3f5cbd542123e9aff14013) without a leader session ID, even though it requires to have one. java.lang.Exception: Received a message CancelJob(948b32f3fb3f5cbd542123e9aff14013) without a leader session ID, even though it requires to have one. at org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:49) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Not sure, if cluster mode is affected, too (could not try it out, but would assume yes). CliFrontend is *not* affected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2401) Replace ActorRefs with ActorGateway in web server
[ https://issues.apache.org/jira/browse/FLINK-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2401. Resolution: Duplicate See FLINK-2409 Replace ActorRefs with ActorGateway in web server - Key: FLINK-2401 URL: https://issues.apache.org/jira/browse/FLINK-2401 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Priority: Minor The web server is the only remaining component which uses {{ActorRefs}} directly to communicate with Flink actors. They should be replaced by {{ActorGateways}} which allow the automatic decoration of messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2409) Old JM web interface is sending cancel messages w/o leader ID
[ https://issues.apache.org/jira/browse/FLINK-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2409. Resolution: Fixed Fixed via fab61a1954ff1554448e826e1d273689ed520fc3 Old JM web interface is sending cancel messages w/o leader ID - Key: FLINK-2409 URL: https://issues.apache.org/jira/browse/FLINK-2409 Project: Flink Issue Type: Bug Components: Webfrontend Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Till Rohrmann {code} 12:29:41,877 ERROR akka.actor.OneForOneStrategy - Received a message CancelJob(4b3631741c344881362ea46e29980ce4) without a leader session ID, even though it requires to have one. java.lang.Exception: Received a message CancelJob(4b3631741c344881362ea46e29980ce4) without a leader session ID, even though it requires to have one. at org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:49) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 12:29:41,879 INFO org.apache.flink.runtime.jobmanager.JobManager - Stopping JobManager akka://flink/user/jobmanager#-638215033. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2472) Make the JobClientActor check periodically if the submitted Job is still running and if the JobManager is still alive
Till Rohrmann created FLINK-2472: Summary: Make the JobClientActor check periodically if the submitted Job is still running and if the JobManager is still alive Key: FLINK-2472 URL: https://issues.apache.org/jira/browse/FLINK-2472 Project: Flink Issue Type: Bug Reporter: Till Rohrmann In case that the {{JobManager}} dies without notifying possibly connected {{JobClientActors}} or if the job execution finishes without sending the {{SerializedJobExecutionResult}} back to the {{JobClientActor}}, it might happen that a {{JobClient.submitJobAndWait}} never returns. I propose to let the {{JobClientActor}} periodically check whether the {{JobManager}} is still alive and whether the submitted job is still running. If not, then the {{JobClientActor}} should return an exception to complete the waiting future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2472) Make the JobClientActor check periodically if the submitted Job is still running and if the JobManager is still alive
[ https://issues.apache.org/jira/browse/FLINK-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-2472: - Issue Type: Improvement (was: Bug) Make the JobClientActor check periodically if the submitted Job is still running and if the JobManager is still alive - Key: FLINK-2472 URL: https://issues.apache.org/jira/browse/FLINK-2472 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann In case that the {{JobManager}} dies without notifying possibly connected {{JobClientActors}} or if the job execution finishes without sending the {{SerializedJobExecutionResult}} back to the {{JobClientActor}}, it might happen that a {{JobClient.submitJobAndWait}} never returns. I propose to let the {{JobClientActor}} periodically check whether the {{JobManager}} is still alive and whether the submitted job is still running. If not, then the {{JobClientActor}} should return an exception to complete the waiting future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2504) ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed spuriously
[ https://issues.apache.org/jira/browse/FLINK-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681475#comment-14681475 ] Till Rohrmann commented on FLINK-2504: -- [~StephanEwen] the output of travis is all I got. There is apparently a problem with the watchdog script for my repository which prevented the uploading. [~sachingoel0101] your stack trace looks similar to FLINK-1455. ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed spuriously - Key: FLINK-2504 URL: https://issues.apache.org/jira/browse/FLINK-2504 Project: Flink Issue Type: Bug Reporter: Till Rohrmann The test {{ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed}} failed in one of my Travis builds: https://travis-ci.org/tillrohrmann/flink/jobs/74881883 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2504) ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed spuriously
Till Rohrmann created FLINK-2504: Summary: ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed spuriously Key: FLINK-2504 URL: https://issues.apache.org/jira/browse/FLINK-2504 Project: Flink Issue Type: Bug Reporter: Till Rohrmann The test {{ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed}} failed in one of my Travis builds: https://travis-ci.org/tillrohrmann/flink/jobs/74881883 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2521) Add automatic test name logging for tests
Till Rohrmann created FLINK-2521: Summary: Add automatic test name logging for tests Key: FLINK-2521 URL: https://issues.apache.org/jira/browse/FLINK-2521 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Till Rohrmann Priority: Minor When running tests on travis the Flink components log to a file. This is helpful in case of a failed test to retrieve the error. However, the log does not contain the test name and the reason for the failure. Therefore it is difficult to find the log output which corresponds to the failed test. It would be nice to automatically add the test case information to the log. This would ease the debugging process big time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2518) Avoid predetermination of ports for network services
Till Rohrmann created FLINK-2518: Summary: Avoid predetermination of ports for network services Key: FLINK-2518 URL: https://issues.apache.org/jira/browse/FLINK-2518 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Fix For: 0.10 Some of Flink's network services use the {{NetUtils.getAvailablePort()}} to predetermine an available port for a service which is later started. This can lead to a race condition where two services have predetermined the same available port and later fail to instantiate because for one of them the port is already in use. This is, for example, the case for the {{NettyConnectionManager}} which is started after the {{TaskManager}} has registered at the {{JobManager}}. It would be better if we first start the network services with a random port, e.g. the {{NettyConnectionManager}}, and then send the bound port to the client. This will avoid problems like that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2409) Old JM web interface is sending cancel messages w/o leader ID
[ https://issues.apache.org/jira/browse/FLINK-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-2409: Assignee: Till Rohrmann Old JM web interface is sending cancel messages w/o leader ID - Key: FLINK-2409 URL: https://issues.apache.org/jira/browse/FLINK-2409 Project: Flink Issue Type: Bug Components: Webfrontend Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Till Rohrmann {code} 12:29:41,877 ERROR akka.actor.OneForOneStrategy - Received a message CancelJob(4b3631741c344881362ea46e29980ce4) without a leader session ID, even though it requires to have one. java.lang.Exception: Received a message CancelJob(4b3631741c344881362ea46e29980ce4) without a leader session ID, even though it requires to have one. at org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:49) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 12:29:41,879 INFO org.apache.flink.runtime.jobmanager.JobManager - Stopping JobManager akka://flink/user/jobmanager#-638215033. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636451#comment-14636451 ] Till Rohrmann commented on FLINK-1901: -- If you use the sampling operator this way, it works. However, usually your iteration data set is something like the weight vector of your model and you have another training dataset from which you want to take a small sample to update your weight vector in each iteration (e.g. SGD). When you write a program like that, then you'll see that the output of the sampling operator will always be the same (for every iteration). The reason is that the sampling no longer is on the dynamic path of the iteration and thus it is only once calculated and then cached. This is not the intended behaviour, though. Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635206#comment-14635206 ] Till Rohrmann commented on FLINK-1901: -- I think this solution is indeed a little bit too hacky. It would be very unintuitive for the user having to broadcast the iteration {{DataSet}} to the sampling operator. Furthermore, this will inflict unnecessary network I/O. I think we should try to solve this problem properly. This means that we have a single sampling operator which works inside and outside of iterations. This will also avoid code duplication. Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2332) Assign session IDs to JobManager and TaskManager messages
[ https://issues.apache.org/jira/browse/FLINK-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2332. Resolution: Fixed Added via 45428518d0e1b843947a6184b4a803a78ad5 Assign session IDs to JobManager and TaskManager messages - Key: FLINK-2332 URL: https://issues.apache.org/jira/browse/FLINK-2332 Project: Flink Issue Type: Sub-task Components: JobManager, TaskManager Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 0.10 In order to support true high availability {{TaskManager}} and {{JobManager}} have to be able to distinguish whether a message was sent from the leader or whether a message was sent from a former leader. Messages which come from a former leader have to be discarded in order to guarantee a consistent state. A way to do achieve this is to assign a leader session ID to a {{JobManager}} once he's elected as leader. This leader session ID is sent to the {{TaskManager}} upon registration at the {{JobManager}}. All subsequent messages should then be decorated with this leader session ID to mark them as valid. On the {{TaskManager}} side the received leader session ID as a response to the registration message, can then be used to validate incoming messages. The same holds true for registration messages which should have a registration session ID, too. That way, it is possible to distinguish invalid registration messages from valid ones. The registration session ID can be assigned once the TaskManager is notified about the new leader. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2392) Instable test in flink-yarn-tests
[ https://issues.apache.org/jira/browse/FLINK-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638985#comment-14638985 ] Till Rohrmann commented on FLINK-2392: -- The same happened to me: https://s3.amazonaws.com/archive.travis-ci.org/jobs/72303347/log.txt Instable test in flink-yarn-tests - Key: FLINK-2392 URL: https://issues.apache.org/jira/browse/FLINK-2392 Project: Flink Issue Type: Bug Components: Tests Reporter: Matthias J. Sax Priority: Minor The test YARNSessionFIFOITCase fails from time to time on an irregular basis. For example see: https://travis-ci.org/apache/flink/jobs/72019690 Tests run: 12, Failures: 1, Errors: 0, Skipped: 2, Time elapsed: 205.163 sec FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOITCase perJobYarnClusterWithParallelism(org.apache.flink.yarn.YARNSessionFIFOITCase) Time elapsed: 60.651 sec FAILURE! java.lang.AssertionError: During the timeout period of 60 seconds the expected string did not show up at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.yarn.YarnTestBase.runWithArgs(YarnTestBase.java:478) at org.apache.flink.yarn.YARNSessionFIFOITCase.perJobYarnClusterWithParallelism(YARNSessionFIFOITCase.java:435) Results : Failed tests: YARNSessionFIFOITCase.perJobYarnClusterWithParallelism:435-YarnTestBase.runWithArgs:478 During the timeout period of 60 seconds the expected string did not show up -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638429#comment-14638429 ] Till Rohrmann commented on FLINK-1901: -- That's a good idea to break down the task. Do you want to take the lead [~chengxiang li]? Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638448#comment-14638448 ] Till Rohrmann commented on FLINK-1901: -- Currently, whats happening to decide whether an operator is on a dynamic path or not is to look at the inputs of the operator. If they are dynamic so is the current operator. The iteration {{DataSets}}, {{WorksetPlaceHolder}}, {{SolutionSetPlaceHolder}} and {{PartialSolutionPlaceHolder}}, are always dynamic. What could be an idea is to allow other operators also to be declared dynamic. That way they can also start dynamic path. Afterwards, we have to make sure that not only the iteration {{DataSets}} get a {{IterationHead}} prepended, which kicks off the iterations, but also all the other operators which start a dynamic path. Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2395) Check Scala catch blocks for Throwable
Till Rohrmann created FLINK-2395: Summary: Check Scala catch blocks for Throwable Key: FLINK-2395 URL: https://issues.apache.org/jira/browse/FLINK-2395 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Priority: Minor As described in [1], it's not a good practice to catch {{Throwables}} in Scala catch blocks because Scala uses some special exceptions for the control flow. Therefore we should check whether we can restrict ourselves to only catching subtypes of {{Throwable}}, such as {{Exception}}, instead. [1] https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1737) Add statistical whitening transformation to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624402#comment-14624402 ] Till Rohrmann commented on FLINK-1737: -- Great to hear [~Daniel Pape]. I assigned the ticket to you :-) Add statistical whitening transformation to machine learning library Key: FLINK-1737 URL: https://issues.apache.org/jira/browse/FLINK-1737 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Daniel Pape Labels: ML, Starter The statistical whitening transformation [1] is a preprocessing step for different ML algorithms. It decorrelates the individual dimensions and sets its variance to 1. Statistical whitening should be implemented as a {{Transfomer}}. Resources: [1] [http://en.wikipedia.org/wiki/Whitening_transformation] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1737) Add statistical whitening transformation to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-1737: - Assignee: Daniel Pape Add statistical whitening transformation to machine learning library Key: FLINK-1737 URL: https://issues.apache.org/jira/browse/FLINK-1737 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Daniel Pape Labels: ML, Starter The statistical whitening transformation [1] is a preprocessing step for different ML algorithms. It decorrelates the individual dimensions and sets its variance to 1. Statistical whitening should be implemented as a {{Transfomer}}. Resources: [1] [http://en.wikipedia.org/wiki/Whitening_transformation] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1748) Integrate PageRank implementation into machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-1748: - Priority: Minor (was: Major) Integrate PageRank implementation into machine learning library --- Key: FLINK-1748 URL: https://issues.apache.org/jira/browse/FLINK-1748 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Priority: Minor Labels: ML, Starter We already have an excellent approximative PageRank [1] implementation which has been contributed by [~StephanEwen]. Making this implementation part of the machine learning library would be a great addition. Resources: [1] [http://en.wikipedia.org/wiki/PageRank] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629505#comment-14629505 ] Till Rohrmann commented on FLINK-2366: -- But why can you not do the same with ZK? If you start a DB process next to your Flink cluster, then you can also start a ZK process, right? HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629507#comment-14629507 ] Till Rohrmann commented on FLINK-2366: -- But in the end, you will execute your jobs on a Flink cluster, right? Or what do you do with Flink as a library? HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629565#comment-14629565 ] Till Rohrmann commented on FLINK-2366: -- What do you mean exactly with embedded and distributed? If you use Flink's embedded mode, then this would mean that every node would work independently of each other. There is no possibility to make the embedded Flink instances to work together. If you want to run Flink in a distributed manner, then you have to start a Flink cluster. And then you can also start a ZK on the same nodes, if there is none available. But usually, you want to run these kind of things on highly reliable nodes and not in yarn containers, for example. On a first glance, copycat seems to be usable for HA, as well. It offers a similar functionality what we're currently using from ZK. It should not be a problem to implement a new {{LeaderElectionService}}/{{LeaderRetrievalService}} which uses copycat. However, copycat is still under heavy development and not recommended to be used in production. And you have the problem that your fault-tolerant value store is running on the same nodes as Flink. These nodes don't have to be necessarily reliable. HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Improvement Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-1748) Integrate PageRank implementation into machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-1748. Resolution: Done As part of Gelly. Integrate PageRank implementation into machine learning library --- Key: FLINK-1748 URL: https://issues.apache.org/jira/browse/FLINK-1748 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Priority: Minor Labels: ML, Starter We already have an excellent approximative PageRank [1] implementation which has been contributed by [~StephanEwen]. Making this implementation part of the machine learning library would be a great addition. Resources: [1] [http://en.wikipedia.org/wiki/PageRank] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2366) HA Without ZooKeeper
[ https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-2366: - Priority: Minor (was: Blocker) HA Without ZooKeeper Key: FLINK-2366 URL: https://issues.apache.org/jira/browse/FLINK-2366 Project: Flink Issue Type: Bug Reporter: Suminda Dharmasena Priority: Minor Please provide a way to do HA without having to use ZK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634664#comment-14634664 ] Till Rohrmann commented on FLINK-1901: -- Hi Chengxiang, good to hear that you want to work in this. I can assign you the ticket. However, it is not only about the sampling strategy but also about the integration within Flink. The reason is that we have to make sure that the sampling operator also works within iterations. This means that it has to be part of the dynamic path so that it is triggered for every iteration again and again. This will need a special operator type. But you can start with the sampling strategies and then continue with the iteration integration. Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634800#comment-14634800 ] Till Rohrmann commented on FLINK-1901: -- Oh I forgot. Sorry. What about the iteration support? Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634826#comment-14634826 ] Till Rohrmann commented on FLINK-1901: -- To be honest, I doubt that the sampling is executed repeatedly if it's not the iteration data set from which you're sampling. If you use map and reduce operations which lie on the static path, then the results will be executed once and cached. But best you check the samples. If it is possible to create a separate PR out of it, then it would be great. Makes reviewing much easier. Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-1901: - Assignee: Chengxiang Li Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1901) Create sample operator for Dataset
[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635096#comment-14635096 ] Till Rohrmann commented on FLINK-1901: -- The problem is that a sampling operator should also work within iterations. There is definitely a big need for this, e.g. for stochastic gradient descent. I don't really understand what you mean with your question [~sachingoel0101]. Create sample operator for Dataset -- Key: FLINK-1901 URL: https://issues.apache.org/jira/browse/FLINK-1901 Project: Flink Issue Type: Improvement Components: Core Reporter: Theodore Vasiloudis Assignee: Chengxiang Li In order to be able to implement Stochastic Gradient Descent and a number of other machine learning algorithms we need to have a way to take a random sample from a Dataset. We need to be able to sample with or without replacement from the Dataset, choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2189) NullPointerException in MutableHashTable
[ https://issues.apache.org/jira/browse/FLINK-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711620#comment-14711620 ] Till Rohrmann commented on FLINK-2189: -- [~JonathanH5] encountered this problem recently. NullPointerException in MutableHashTable Key: FLINK-2189 URL: https://issues.apache.org/jira/browse/FLINK-2189 Project: Flink Issue Type: Bug Components: Core Reporter: Till Rohrmann [~Felix Neutatz] reported a {{NullPointerException}} in the {{MutableHashTable}} when running the {{ALS}} algorithm. The stack trace is the following: {code} Caused by: java.lang.NullPointerException at org.apache.flink.runtime.operators.hash.HashPartition.spillPartition(HashPartition.java:310) at org.apache.flink.runtime.operators.hash.MutableHashTable.spillPartition(MutableHashTable.java:1094) at org.apache.flink.runtime.operators.hash.MutableHashTable.insertBucketEntry(MutableHashTable.java:927) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:783) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:745) {code} He produced this error on his local machine with the following code: {code} implicit val env = ExecutionEnvironment.getExecutionEnvironment val links = MovieLensUtils.readLinks(movieLensDir + links.csv) val movies = MovieLensUtils.readMovies(movieLensDir + movies.csv) val ratings = MovieLensUtils.readRatings(movieLensDir + ratings.csv) val tags = MovieLensUtils.readTags(movieLensDir + tags.csv) val ratingMatrix = ratings.map { r = (r.userId.toInt, r.movieId.toInt, r.rating) } val testMatrix = ratings.map { r = (r.userId.toInt, r.movieId.toInt) } val als = ALS() .setIterations(10) .setNumFactors(10) .setBlocks(150) als.fit(ratingMatrix) val result = als.predict(testMatrix) result.print val risk = als.empiricalRisk(ratingMatrix).collect().apply(0) println(Empirical risk: + risk) env.execute() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1195) Improvement of benchmarking infrastructure
[ https://issues.apache.org/jira/browse/FLINK-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711604#comment-14711604 ] Till Rohrmann commented on FLINK-1195: -- I cannot really tell to what extent this PR is subsumed by [~mxm]'s testing infrastructure. But if that's the case, then this issue can be closed. Improvement of benchmarking infrastructure -- Key: FLINK-1195 URL: https://issues.apache.org/jira/browse/FLINK-1195 Project: Flink Issue Type: Wish Reporter: Till Rohrmann Assignee: Alexander Alexandrov I noticed while running my ALS benchmarks that we still have some potential to improve our benchmarking infrastructure. The current state is that we execute the benchmark jobs by writing a script with a single set of parameters. The runtime is then manually retrieved from the web interface of Flink and Spark, respectively. I think we need the following extensions: * Automatic runtime retrieval and storage in a file * Repeated execution of jobs to gather some advanced statistics such as mean and standard deviation of the runtimes * Support for value sets for the individual parameters The automatic runtime retrieval would allow us to execute several benchmarks consecutively without having to lookup the runtimes in the logs or in the web interface, which btw only stores the runtimes of the last 5 jobs. What I mean with value sets is that would be nice to specify a set of parameter values for which the benchmark is run without having to write for every single parameter combination a benchmark script. I believe that this feature would become very handy when we want to look at the runtime behaviour of Flink for different input sizes or degrees of parallelism, for example. To illustrate what I mean: {code} INPUTSIZE = 1000, 2000, 4000, 8000 DOP = 1, 2, 4, 8 OUTPUT=benchmarkResults repetitions=10 command=benchmark.jar -p $DOP $INPUTSIZE {code} Something like that would execute the benchmark job with (DOP=1, INPUTSIZE=1000), (DOP=2, INPUTSIZE=2000), 10 times each, calculate for each parameter combination runtime statistics and store the results in the file benchmarkResults. I believe that spending some effort now will pay off in the long run because we will benchmark Flink continuously. What do you guys think? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2878) JobManager warns: Unexpected leader address pattern
[ https://issues.apache.org/jira/browse/FLINK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2878. -- Resolution: Fixed Fixed via 3cad56d28d55025281873f53a28ec27ce1027992 > JobManager warns: Unexpected leader address pattern > --- > > Key: FLINK-2878 > URL: https://issues.apache.org/jira/browse/FLINK-2878 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 0.10 >Reporter: Maximilian Michels >Assignee: Till Rohrmann >Priority: Minor > Fix For: 0.10 > > > The JobManager log shows this multiple times when viewing the log through the > web frontend: > {noformat} > 16:58:37,201 WARN > org.apache.flink.runtime.webmonitor.handlers.HandlerRedirectUtils - > Unexpected leader address pattern akka://flink/user/jobmanager. Cannot > extract host. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2921) Add online documentation of sample methods
Till Rohrmann created FLINK-2921: Summary: Add online documentation of sample methods Key: FLINK-2921 URL: https://issues.apache.org/jira/browse/FLINK-2921 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 0.10 Reporter: Till Rohrmann Priority: Minor I couldn't find online documentation about Flink's sampling API (as part of the {{DataSetUtils}}/{{utils}} package object). We should add information for these methods to our online documentation so that people can more easily use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2894) Flink does not allow to specify default serializer for Kryo
Till Rohrmann created FLINK-2894: Summary: Flink does not allow to specify default serializer for Kryo Key: FLINK-2894 URL: https://issues.apache.org/jira/browse/FLINK-2894 Project: Flink Issue Type: Bug Affects Versions: 0.10 Reporter: Till Rohrmann Currently, Flink only supports to specify Kryo {{Serializer}} for specific types but not default serializer for classes. A default serializer is used for the registered class and all its subclasses. That way one does not have to specify the serializer for each type individually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2894) Flink does not allow to specify default serializer for Kryo
[ https://issues.apache.org/jira/browse/FLINK-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2894. Resolution: Not A Problem > Flink does not allow to specify default serializer for Kryo > --- > > Key: FLINK-2894 > URL: https://issues.apache.org/jira/browse/FLINK-2894 > Project: Flink > Issue Type: Bug >Affects Versions: 0.10 >Reporter: Till Rohrmann > > Currently, Flink only supports to specify Kryo {{Serializer}} for specific > types but not default serializer for classes. A default serializer is used > for the registered class and all its subclasses. That way one does not have > to specify the serializer for each type individually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2800) kryo serialization problem
[ https://issues.apache.org/jira/browse/FLINK-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-2800: Assignee: Till Rohrmann > kryo serialization problem > -- > > Key: FLINK-2800 > URL: https://issues.apache.org/jira/browse/FLINK-2800 > Project: Flink > Issue Type: Bug > Components: Type Serialization System >Affects Versions: 0.10 > Environment: linux ubuntu 12.04 LTS, Java 7 >Reporter: Stefano Bortoli >Assignee: Till Rohrmann > > Performing a cross of two dataset of POJOs I have got the exception below. > The first time I run the process, there was no problem. When I run it the > second time, I have got the exception. My guess is that it could be a race > condition related to the reuse of the Kryo serializer object. However, it > could also be "a bug where type registrations are not properly forwarded to > all Serializers", as suggested by Stephan. > > 2015-10-01 18:18:21 INFO JobClient:161 - 10/01/2015 18:18:21 Cross(Cross at > main(FlinkMongoHadoop2LinkPOI2CDA.java:160))(3/4) switched to FAILED > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 114 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752) > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:210) > at > org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:127) > at > org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30) > at > org.apache.flink.runtime.operators.resettable.AbstractBlockResettableIterator.getNextRecord(AbstractBlockResettableIterator.java:180) > at > org.apache.flink.runtime.operators.resettable.BlockResettableMutableObjectIterator.next(BlockResettableMutableObjectIterator.java:111) > at > org.apache.flink.runtime.operators.CrossDriver.runBlockedOuterSecond(CrossDriver.java:309) > at > org.apache.flink.runtime.operators.CrossDriver.run(CrossDriver.java:162) > at > org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:489) > at > org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3036) Gelly's Graph.fromCsvReader method returns wrongly parameterized Graph
Till Rohrmann created FLINK-3036: Summary: Gelly's Graph.fromCsvReader method returns wrongly parameterized Graph Key: FLINK-3036 URL: https://issues.apache.org/jira/browse/FLINK-3036 Project: Flink Issue Type: Bug Components: Gelly Affects Versions: 0.10.0 Reporter: Till Rohrmann The Scala method {{Graph.fromCsvReader}} of Gelly returns a wrongly typed {{Graph}} instance. The problem is that no return type has been explicitly defined for the method. Additionally, the method returns fundamentally incompatible types depending on the given parameters. So for example, the method can return a {{Graph[Long, Long, Long]}} if a vertex and edge file is specified (in this case with value type {{Long}}). If the vertex file is not specified and neither a vertex value initializer, then the return type is {{Graph[Long, NullValue, Long]}}. Since {{NullValue}} and {{Long}} have nothing in common, Scala's type inference infers that the {{fromCsvReader}} method must have a return type {{Graph[Long, t >: Long with NullValue, Long]}} with {{t}} being a supertype of {{Long with NullValue}}. This type is not useful at all, since there is no such type. As a consequence, the user has to cast the resulting {{Graph}} to have either the type {{Graph[Long, NullValue, Long]}} or {{Graph[Long, Long, Long]}} if he wants to do something more elaborate than just collecting the edges for example. This can be especially confusing because one usually writes something like {code} val graph = Graph.fromCsvReader[Long, Double, Double](...) graph.run(new PageRank(...)) {code} and does not see that the type of {{graph}} is {{Graph[Long, t >: Double with NullValue, u >: Double with NullValue}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-3036) Gelly's Graph.fromCsvReader method returns wrongly parameterized Graph
[ https://issues.apache.org/jira/browse/FLINK-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-3036: Assignee: Till Rohrmann > Gelly's Graph.fromCsvReader method returns wrongly parameterized Graph > -- > > Key: FLINK-3036 > URL: https://issues.apache.org/jira/browse/FLINK-3036 > Project: Flink > Issue Type: Bug > Components: Gelly >Affects Versions: 0.10.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann > > The Scala method {{Graph.fromCsvReader}} of Gelly returns a wrongly typed > {{Graph}} instance. The problem is that no return type has been explicitly > defined for the method. Additionally, the method returns fundamentally > incompatible types depending on the given parameters. So for example, the > method can return a {{Graph[Long, Long, Long]}} if a vertex and edge file is > specified (in this case with value type {{Long}}). If the vertex file is not > specified and neither a vertex value initializer, then the return type is > {{Graph[Long, NullValue, Long]}}. Since {{NullValue}} and {{Long}} have > nothing in common, Scala's type inference infers that the {{fromCsvReader}} > method must have a return type {{Graph[Long, t >: Long with NullValue, > Long]}} with {{t}} being a supertype of {{Long with NullValue}}. This type is > not useful at all, since there is no such type. As a consequence, the user > has to cast the resulting {{Graph}} to have either the type {{Graph[Long, > NullValue, Long]}} or {{Graph[Long, Long, Long]}} if he wants to do something > more elaborate than just collecting the edges for example. > This can be especially confusing because one usually writes something like > {code} > val graph = Graph.fromCsvReader[Long, Double, Double](...) > graph.run(new PageRank(...)) > {code} > and does not see that the type of {{graph}} is {{Graph[Long, t >: Double with > NullValue, u >: Double with NullValue}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-2979) RollingSink does not work with Hadoop 2.7.1
[ https://issues.apache.org/jira/browse/FLINK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991725#comment-14991725 ] Till Rohrmann edited comment on FLINK-2979 at 11/5/15 2:29 PM: --- The failure might be caused by {code} java.lang.Exception: Could not restore checkpointed state to operators and functions at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:414) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Exception: Failed to restore state to function: Could not invoke truncate. at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:165) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:406) ... 3 more Caused by: java.lang.RuntimeException: Could not invoke truncate. at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:695) at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:120) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) ... 4 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:678) ... 6 more Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to TRUNCATE_FILE /string-non-rolling-out/part-2-2 for DFSClient_NONMAPREDUCE_-401178409_229 on 127.0.0.1 because DFSClient_NONMAPREDUCE_-401178409_229 is already the current lease holder. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInternal(FSNamesystem.java:2082) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInt(FSNamesystem.java:2028) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncate(FSNamesystem.java:1998) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.truncate(NameNodeRpcServer.java:926) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.truncate(ClientNamenodeProtocolServerSideTranslatorPB.java:599) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) at org.apache.hadoop.ipc.Client.call(Client.java:1476) at org.apache.hadoop.ipc.Client.call(Client.java:1407) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy23.truncate(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.truncate(ClientNamenodeProtocolTranslatorPB.java:313) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy24.truncate(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.truncate(DFSClient.java:2024) at org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:689) at
[jira] [Commented] (FLINK-2979) RollingSink does not work with Hadoop 2.7.1
[ https://issues.apache.org/jira/browse/FLINK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991725#comment-14991725 ] Till Rohrmann commented on FLINK-2979: -- The failure might be caused by {code} java.lang.Exception: Could not restore checkpointed state to operators and functions at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:414) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Exception: Failed to restore state to function: Could not invoke truncate. at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:165) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:406) ... 3 more Caused by: java.lang.RuntimeException: Could not invoke truncate. at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:695) at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:120) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162) ... 4 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:678) ... 6 more Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to TRUNCATE_FILE /string-non-rolling-out/part-2-2 for DFSClient_NONMAPREDUCE_-401178409_229 on 127.0.0.1 because DFSClient_NONMAPREDUCE_-401178409_229 is already the current lease holder. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInternal(FSNamesystem.java:2082) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInt(FSNamesystem.java:2028) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncate(FSNamesystem.java:1998) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.truncate(NameNodeRpcServer.java:926) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.truncate(ClientNamenodeProtocolServerSideTranslatorPB.java:599) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) at org.apache.hadoop.ipc.Client.call(Client.java:1476) at org.apache.hadoop.ipc.Client.call(Client.java:1407) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy23.truncate(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.truncate(ClientNamenodeProtocolTranslatorPB.java:313) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy24.truncate(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.truncate(DFSClient.java:2024) at org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:689) at org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:685) at
[jira] [Created] (FLINK-2964) MutableHashTable fails when spilling partitions without overflow segments
Till Rohrmann created FLINK-2964: Summary: MutableHashTable fails when spilling partitions without overflow segments Key: FLINK-2964 URL: https://issues.apache.org/jira/browse/FLINK-2964 Project: Flink Issue Type: Bug Affects Versions: 0.10 Reporter: Till Rohrmann Assignee: Till Rohrmann Priority: Critical When one performs a join operation with many and large records then the join operation fails with the following exception when it tries to spill a {{HashPartition}}. {code} java.lang.RuntimeException: Bug in Hybrid Hash Join: Request to spill a partition with less than two buffers. at org.apache.flink.runtime.operators.hash.HashPartition.spillPartition(HashPartition.java:302) at org.apache.flink.runtime.operators.hash.MutableHashTable.spillPartition(MutableHashTable.java:1108) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextSegment(MutableHashTable.java:1277) at org.apache.flink.runtime.operators.hash.HashPartition$BuildSideBuffer.nextSegment(HashPartition.java:524) at org.apache.flink.runtime.memory.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140) at org.apache.flink.runtime.memory.AbstractPagedOutputView.write(AbstractPagedOutputView.java:201) at org.apache.flink.runtime.memory.AbstractPagedOutputView.write(AbstractPagedOutputView.java:178) at org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.serialize(BytePrimitiveArraySerializer.java:74) at org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.serialize(BytePrimitiveArraySerializer.java:30) at org.apache.flink.runtime.operators.hash.HashPartition.insertIntoBuildBuffer(HashPartition.java:257) at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:856) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildInitialTable(MutableHashTable.java:685) at org.apache.flink.runtime.operators.hash.MutableHashTable.open(MutableHashTable.java:443) at org.apache.flink.runtime.operators.hash.HashTableTest.testSpillingWhenBuildingTableWithoutOverflow(HashTableTest.java:234) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) {code} The reason is that the {{HashPartition}} does not include the number of used memory segments by the {{BuildSideBuffer}} when it counts the currently occupied memory segments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985022#comment-14985022 ] Till Rohrmann commented on FLINK-2929: -- Sure, but for me it would make more sense to add the option for the case which is more unlikely and that's probably the upgrading case. > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2959) Remove number of execution retries configuration from Environment
[ https://issues.apache.org/jira/browse/FLINK-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989289#comment-14989289 ] Till Rohrmann commented on FLINK-2959: -- I agree that we have to come up with a consistent way of putting methods in the {{ExecutionConfig}} or in the {{ExecutionEnvironment}}. But for {{0.10}} it should be fine. > Remove number of execution retries configuration from Environment > - > > Key: FLINK-2959 > URL: https://issues.apache.org/jira/browse/FLINK-2959 > Project: Flink > Issue Type: Improvement > Components: Java API, Scala API >Affects Versions: 0.10 >Reporter: Ufuk Celebi >Priority: Minor > Labels: api-breaking > Fix For: 0.10 > > > The number of execution retries is configured via the Environment, but all > other execution configuration happens exclusively via ExecutionConfig. > I think it will be more consistent to have it in ExecutionConfig only. > This will be an API breaking change. > What's your opinion? Should we move it to the ExecutionConfig? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2763) Bug in Hybrid Hash Join: Request to spill a partition with less than two buffers.
[ https://issues.apache.org/jira/browse/FLINK-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2763. -- Resolution: Fixed I think the problem was solved by FLINK-2964. [~f.pompermaier] could you check whether this fix solves your problem? > Bug in Hybrid Hash Join: Request to spill a partition with less than two > buffers. > - > > Key: FLINK-2763 > URL: https://issues.apache.org/jira/browse/FLINK-2763 > Project: Flink > Issue Type: Bug > Components: Distributed Runtime >Affects Versions: 0.10 >Reporter: Greg Hogan >Assignee: Stephan Ewen > Fix For: 0.10 > > > The following exception is thrown when running the example triangle listing > with an unmodified master build (4cadc3d6). > {noformat} > ./bin/flink run > ~/flink-examples/flink-java-examples/target/flink-java-examples-0.10-SNAPSHOT-EnumTrianglesOpt.jar > ~/rmat/undirected/s19_e8.ssv output > {noformat} > The only changes to {{flink-conf.yaml}} are {{taskmanager.numberOfTaskSlots: > 8}} and {{parallelism.default: 8}}. > I have confirmed with input files > [s19_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxR2lnMHR4amdyTnM/view?usp=sharing] > (40 MB) and > [s20_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxNi1HbmptU29MTm8/view?usp=sharing] > (83 MB). On a second machine only the larger file caused the exception. > {noformat} > org.apache.flink.client.program.ProgramInvocationException: The program > execution failed: Job execution failed. > at org.apache.flink.client.program.Client.runBlocking(Client.java:407) > at org.apache.flink.client.program.Client.runBlocking(Client.java:386) > at org.apache.flink.client.program.Client.runBlocking(Client.java:353) > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:64) > at > org.apache.flink.examples.java.graph.EnumTrianglesOpt.main(EnumTrianglesOpt.java:125) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:434) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:350) > at org.apache.flink.client.program.Client.runBlocking(Client.java:290) > at > org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:675) > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:324) > at > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977) > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1027) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:425) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:107) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at >
[jira] [Comment Edited] (FLINK-2763) Bug in Hybrid Hash Join: Request to spill a partition with less than two buffers.
[ https://issues.apache.org/jira/browse/FLINK-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989297#comment-14989297 ] Till Rohrmann edited comment on FLINK-2763 at 11/4/15 10:43 AM: I think the problem was solved by FLINK-2964. [~f.pompermaier] could you check whether the fix 76bebd4236cd9cff19e6442e9ab3d6113665924a solves your problem? was (Author: till.rohrmann): I think the problem was solved by FLINK-2964. [~f.pompermaier] could you check whether this fix solves your problem? > Bug in Hybrid Hash Join: Request to spill a partition with less than two > buffers. > - > > Key: FLINK-2763 > URL: https://issues.apache.org/jira/browse/FLINK-2763 > Project: Flink > Issue Type: Bug > Components: Distributed Runtime >Affects Versions: 0.10 >Reporter: Greg Hogan >Assignee: Stephan Ewen > Fix For: 0.10 > > > The following exception is thrown when running the example triangle listing > with an unmodified master build (4cadc3d6). > {noformat} > ./bin/flink run > ~/flink-examples/flink-java-examples/target/flink-java-examples-0.10-SNAPSHOT-EnumTrianglesOpt.jar > ~/rmat/undirected/s19_e8.ssv output > {noformat} > The only changes to {{flink-conf.yaml}} are {{taskmanager.numberOfTaskSlots: > 8}} and {{parallelism.default: 8}}. > I have confirmed with input files > [s19_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxR2lnMHR4amdyTnM/view?usp=sharing] > (40 MB) and > [s20_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxNi1HbmptU29MTm8/view?usp=sharing] > (83 MB). On a second machine only the larger file caused the exception. > {noformat} > org.apache.flink.client.program.ProgramInvocationException: The program > execution failed: Job execution failed. > at org.apache.flink.client.program.Client.runBlocking(Client.java:407) > at org.apache.flink.client.program.Client.runBlocking(Client.java:386) > at org.apache.flink.client.program.Client.runBlocking(Client.java:353) > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:64) > at > org.apache.flink.examples.java.graph.EnumTrianglesOpt.main(EnumTrianglesOpt.java:125) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:434) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:350) > at org.apache.flink.client.program.Client.runBlocking(Client.java:290) > at > org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:675) > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:324) > at > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977) > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1027) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:425) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:107) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at
[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987374#comment-14987374 ] Till Rohrmann commented on FLINK-2929: -- We could auto generate a random ZNode path for each cluster start. In case of a clean shutdown this path could be removed if not explicitly set to be kept. When starting a new cluster we then could add an option to start with a specific znode path in order to recover from or in case of an upgrade. However, this has the disadvantage that the user would be responsible for cleaning up the state data when it's no longer needed. > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2852) Fix flaky ScalaShellITSuite
Till Rohrmann created FLINK-2852: Summary: Fix flaky ScalaShellITSuite Key: FLINK-2852 URL: https://issues.apache.org/jira/browse/FLINK-2852 Project: Flink Issue Type: Bug Components: Scala Shell Affects Versions: 0.10 Reporter: Till Rohrmann Assignee: Till Rohrmann Priority: Critical Fix For: 0.10 The {{ScalaShellITSuite}} checks the log output whether a job has successful completed or not. For that to happen it checks for a {{Job execution switched to status FINISHED}} string in the log output. However, if the {{FlinkClientActor}} receives first a {{JobResultSuccess}} message before receiving the {{JobStatusChanged}} message, then it will send the execution result back to the {{Client}} and terminate itself. This has the consequence that the output will never contain the above-mentioned string. I propose to use a different mean to check whether a job has finished successfully or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2735) KafkaProducerITCase.testCustomPartitioning sporadically fails
[ https://issues.apache.org/jira/browse/FLINK-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958775#comment-14958775 ] Till Rohrmann commented on FLINK-2735: -- Here is another instance of the problem: https://s3.amazonaws.com/archive.travis-ci.org/jobs/85472091/log.txt > KafkaProducerITCase.testCustomPartitioning sporadically fails > - > > Key: FLINK-2735 > URL: https://issues.apache.org/jira/browse/FLINK-2735 > Project: Flink > Issue Type: Bug > Components: Kafka Connector >Affects Versions: 0.10 >Reporter: Robert Metzger > Labels: test-stability > > In the following test run: > https://s3.amazonaws.com/archive.travis-ci.org/jobs/8158/log.txt > there was the following failure > {code} > Caused by: java.lang.Exception: Unable to get last offset for topic > customPartitioningTestTopic and partitions [FetchPartition {partition=2, > offset=-915623761776}]. > Exception for partition 2: kafka.common.UnknownException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at java.lang.Class.newInstance(Class.java:438) > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) > at kafka.common.ErrorMapping.exceptionFor(ErrorMapping.scala) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getLastOffset(LegacyFetcher.java:521) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:370) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher.run(LegacyFetcher.java:242) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.run(FlinkKafkaConsumer.java:382) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58) > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:58) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:168) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Unable to get last offset for topic > customPartitioningTestTopic and partitions [FetchPartition {partition=2, > offset=-915623761776}]. > Exception for partition 2: kafka.common.UnknownException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at java.lang.Class.newInstance(Class.java:438) > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) > at kafka.common.ErrorMapping.exceptionFor(ErrorMapping.scala) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getLastOffset(LegacyFetcher.java:521) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:370) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getLastOffset(LegacyFetcher.java:524) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:370) > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.455 sec > <<< FAILURE! - in > org.apache.flink.streaming.connectors.kafka.KafkaProducerITCase > testCustomPartitioning(org.apache.flink.streaming.connectors.kafka.KafkaProducerITCase) > Time elapsed: 7.809 sec <<< FAILURE! > java.lang.AssertionError: Test failed: The program execution failed: Job > execution failed. > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.flink.streaming.connectors.kafka.KafkaTestBase.tryExecute(KafkaTestBase.java:313) > at > org.apache.flink.streaming.connectors.kafka.KafkaProducerITCase.testCustomPartitioning(KafkaProducerITCase.java:155) > {code} > From the broker logs it seems to be an issue in the Kafka broker > {code} > 14:43:03,328 INFO kafka.network.Processor >- Closing socket connection to /127.0.0.1. > 14:43:03,334 WARN kafka.server.KafkaApis >- [KafkaApi-0]
[jira] [Closed] (FLINK-2770) KafkaITCase.testConcurrentProducerConsumerTopology fails
[ https://issues.apache.org/jira/browse/FLINK-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2770. Resolution: Duplicate > KafkaITCase.testConcurrentProducerConsumerTopology fails > > > Key: FLINK-2770 > URL: https://issues.apache.org/jira/browse/FLINK-2770 > Project: Flink > Issue Type: Bug >Affects Versions: 0.10 >Reporter: Matthias J. Sax >Priority: Critical > Fix For: 0.10 > > > https://travis-ci.org/mjsax/flink/jobs/82308003 > {noformat} > Running org.apache.flink.streaming.connectors.kafka.KafkaITCase > 09/26/2015 17:52:50 Job execution switched to status RUNNING. > 09/26/2015 17:52:50 Source: Custom Source -> Sink: Unnamed(1/1) switched to > SCHEDULED > 09/26/2015 17:52:50 Source: Custom Source -> Sink: Unnamed(1/1) switched to > DEPLOYING > 09/26/2015 17:52:50 Source: Custom Source -> Sink: Unnamed(1/1) switched to > RUNNING > 09/26/2015 17:52:50 Source: Custom Source -> Sink: Unnamed(1/1) switched to > FINISHED > 09/26/2015 17:52:50 Job execution switched to status FINISHED. > 09/26/2015 17:52:50 Job execution switched to status RUNNING. > 09/26/2015 17:52:50 Source: Custom Source -> Map -> Flat Map(1/1) switched > to SCHEDULED > 09/26/2015 17:52:50 Source: Custom Source -> Map -> Flat Map(1/1) switched > to DEPLOYING > 09/26/2015 17:52:50 Source: Custom Source -> Map -> Flat Map(1/1) switched > to RUNNING > 09/26/2015 17:52:51 Source: Custom Source -> Map -> Flat Map(1/1) switched > to FAILED > java.lang.Exception: Could not forward element to next operator > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher.run(LegacyFetcher.java:242) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.run(FlinkKafkaConsumer.java:382) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57) > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:57) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:198) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:580) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.RuntimeException: Could not forward element to next > operator > at > org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:332) > at > org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:316) > at > org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:50) > at > org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:30) > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask$SourceOutput.collect(SourceStreamTask.java:106) > at > org.apache.flink.streaming.api.operators.StreamSource$NonTimestampContext.collect(StreamSource.java:92) > at > org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:449) > Caused by: java.lang.RuntimeException: Could not forward element to next > operator > at > org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:332) > at > org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:316) > at > org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:50) > at > org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:30) > at > org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:37) > at > org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:329) > ... 6 more > Caused by: > org.apache.flink.streaming.connectors.kafka.testutils.SuccessException > at > org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase$7.flatMap(KafkaConsumerTestBase.java:931) > at > org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase$7.flatMap(KafkaConsumerTestBase.java:911) > at > org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:47) > at > org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:329) > ... 11 more > 09/26/2015 17:52:51 Job execution switched to status FAILING. > 09/26/2015 17:52:51 Job execution switched to status FAILED. > Tests run: 12, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 80.981 sec > <<< FAILURE! - in org.apache.flink.streaming.connectors.kafka.KafkaITCase >
[jira] [Created] (FLINK-2854) KafkaITCase.testOneSourceMultiplePartitions failed on Travis
Till Rohrmann created FLINK-2854: Summary: KafkaITCase.testOneSourceMultiplePartitions failed on Travis Key: FLINK-2854 URL: https://issues.apache.org/jira/browse/FLINK-2854 Project: Flink Issue Type: Bug Components: Kafka Connector Affects Versions: 0.10 Reporter: Till Rohrmann Priority: Critical The {{KafkaITCase.testOneSourceMultiplePartitions}} failed on Travis with no output for 300s. https://s3.amazonaws.com/archive.travis-ci.org/jobs/85472083/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2858) Cannot build Flink Scala 2.11 with IntelliJ
Till Rohrmann created FLINK-2858: Summary: Cannot build Flink Scala 2.11 with IntelliJ Key: FLINK-2858 URL: https://issues.apache.org/jira/browse/FLINK-2858 Project: Flink Issue Type: Bug Affects Versions: 0.10 Reporter: Till Rohrmann If I activate the scala-2.11 profile from within IntelliJ (and thus deactivate the scala-2.10 profile) in order to build Flink with Scala 2.11, then Flink cannot be built. The problem is that some Scala macros cannot be expanded because they were compiled with the wrong version (I assume 2.10). This makes debugging tests with Scala 2.11 in IntelliJ impossible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2852) Fix flaky ScalaShellITSuite
[ https://issues.apache.org/jira/browse/FLINK-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964757#comment-14964757 ] Till Rohrmann commented on FLINK-2852: -- Hi [~sachingoel0101], you're right that the test is not robust. I've reworked and pushed a commit which should hopefully solve the problem. > Fix flaky ScalaShellITSuite > --- > > Key: FLINK-2852 > URL: https://issues.apache.org/jira/browse/FLINK-2852 > Project: Flink > Issue Type: Bug > Components: Scala Shell >Affects Versions: 0.10 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: test-stability > Fix For: 0.10 > > > The {{ScalaShellITSuite}} checks the log output whether a job has successful > completed or not. For that to happen it checks for a {{Job execution switched > to status FINISHED}} string in the log output. However, if the > {{FlinkClientActor}} receives first a {{JobResultSuccess}} message before > receiving the {{JobStatusChanged}} message, then it will send the execution > result back to the {{Client}} and terminate itself. This has the consequence > that the output will never contain the above-mentioned string. > I propose to use a different mean to check whether a job has finished > successfully or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2852) Fix flaky ScalaShellITSuite
[ https://issues.apache.org/jira/browse/FLINK-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2852. Resolution: Fixed Fixed in 630798d > Fix flaky ScalaShellITSuite > --- > > Key: FLINK-2852 > URL: https://issues.apache.org/jira/browse/FLINK-2852 > Project: Flink > Issue Type: Bug > Components: Scala Shell >Affects Versions: 0.10 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Critical > Labels: test-stability > Fix For: 0.10 > > > The {{ScalaShellITSuite}} checks the log output whether a job has successful > completed or not. For that to happen it checks for a {{Job execution switched > to status FINISHED}} string in the log output. However, if the > {{FlinkClientActor}} receives first a {{JobResultSuccess}} message before > receiving the {{JobStatusChanged}} message, then it will send the execution > result back to the {{Client}} and terminate itself. This has the consequence > that the output will never contain the above-mentioned string. > I propose to use a different mean to check whether a job has finished > successfully or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2652) Failing PartitionRequestClientFactoryTest
[ https://issues.apache.org/jira/browse/FLINK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2652. -- Resolution: Fixed Fixed in b2339464 > Failing PartitionRequestClientFactoryTest > - > > Key: FLINK-2652 > URL: https://issues.apache.org/jira/browse/FLINK-2652 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Priority: Minor > Labels: test-stability > > PartitionRequestClientFactoryTest fails when running {{mvn > -Dhadoop.version=2.6.0 clean verify}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2804) Support blocking job submission with Job Manager recovery
[ https://issues.apache.org/jira/browse/FLINK-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann closed FLINK-2804. Resolution: Fixed Added in d18f580 > Support blocking job submission with Job Manager recovery > - > > Key: FLINK-2804 > URL: https://issues.apache.org/jira/browse/FLINK-2804 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi >Assignee: Till Rohrmann >Priority: Minor > > Submitting a job in a blocking fashion with JobManager recovery and a failing > JobManager fails on the client side (the one submitting the job). The job > still continues to be recovered. > I propose to add simple support to re-retrieve the leading job manager and > update the client actor with it and then wait for the result as before. > As of the current standing in PR #1153 > (https://github.com/apache/flink/pull/1153) the job manager assumes that the > same actor is running and just keeps on sending execution state updates etc. > (if the listening behaviour is not detached). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2792) Set log level of actor messages to TRACE
[ https://issues.apache.org/jira/browse/FLINK-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2792. -- Resolution: Fixed Fixed in 3aaee1e > Set log level of actor messages to TRACE > > > Key: FLINK-2792 > URL: https://issues.apache.org/jira/browse/FLINK-2792 > Project: Flink > Issue Type: Wish > Components: JobManager >Reporter: Ufuk Celebi >Priority: Trivial > > Logging of received job manager actor messages happens at log level DEBUG > right now. The used logger is that of the JobManager/TaskManager > respectively. This means that as soon as you debug something related to the > JobManager/TaskManager you are always flooded with a lot of debug messages. > Therefore, I would like to set the log level to TRACE for these messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (FLINK-2354) Recover running jobs on JobManager failure
[ https://issues.apache.org/jira/browse/FLINK-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann resolved FLINK-2354. -- Resolution: Fixed Added in a6890b2 > Recover running jobs on JobManager failure > -- > > Key: FLINK-2354 > URL: https://issues.apache.org/jira/browse/FLINK-2354 > Project: Flink > Issue Type: Sub-task > Components: JobManager >Affects Versions: 0.10 >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 0.10 > > > tl;dr Persist JobGraphs in state backend and coordinate reference to state > handle via ZooKeeper. > Problem: When running multiple JobManagers in high availability mode, the > leading job manager looses all running jobs when it fails. After a new > leading job manager is elected, it is not possible to recover any previously > running jobs. > Solution: The leading job manager, which receives the job graph writes 1) the > job graph to a state backend, and 2) a reference to the respective state > handle to ZooKeeper. In general, job graphs can become large (multiple MBs, > because they include closures etc.). ZooKeeper is not designed for data of > this size. The level of indirection via the reference to the state backend > keeps the data in ZooKeeper small. > Proposed ZooKeeper layout: > /flink (default) > +- currentJobs >+- job id i > +- state handle reference of job graph i > The 'currentJobs' node needs to be persistent to allow recovery of jobs > between job managers. The currentJobs node needs to satisfy the following > invariant: There is a reference to a job graph with id i IFF the respective > job graph needs to be recovered by a newly elected job manager leader. > With this in place, jobs will be recovered from their initial state (as if > resubmitted). The next step is to backup the runtime state handles of > checkpoints in a similar manner. > --- > This work will be based on [~trohrm...@apache.org]'s implementation of > FLINK-2291. The leader election service notifies the job manager about > granted/revoked leadership. This notification happens via Akka and thus > serially *per* job manager, but results in eventually consistent state > between job managers. For some snapshots of time it is possible to have a new > leader granted leadership, before the old one has been revoked its leadership. > [~trohrm...@apache.org], can you confirm that leadership does not guarantee > mutually exclusive access to the shared 'currentJobs' state? > For example, the following can happen: > - JM 1 is leader, JM 2 is standby > - JOB i is running (and hence /flink/currentJobs/i exists) > - ZK notifies leader election service (LES) of JM 1 and JM 2 > - LES 2 immediately notifies JM 2 about granted leadership, but LES 1 > notification revoking leadership takes longer > - JOB i finishes (TMs don't notice leadership change yet) and JM 1 receives > final JobStatusChange > - JM 2 resubmits the job /flink/currentJobs/i > - JM 1 removes /flink/currentJobs/i, because it is now finished > => inconsistent state (wrt the specified invariant above) > If it is indeed a problem, we can circumvent this with a Curator recipe for > [shared locks|http://curator.apache.org/curator-recipes/shared-lock.html] to > coordinate the access to currentJobs. The lock needs to be acquired on > leadership. > --- > Minimum required tests: > - Unit tests for job graph serialization and writing to state backend and > ZooKeeper with expected nodes > - Unit tests for job submission to job manager in leader/non-leader state > - Unit tests for leadership granting/revoking and job submission/restarting > interleavings > - Process failure integration tests with single and multiple running jobs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2883) Combinable reduce produces wrong result
[ https://issues.apache.org/jira/browse/FLINK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966537#comment-14966537 ] Till Rohrmann commented on FLINK-2883: -- Yes, documenting it sounds like a reasonable solution. This is probably also more of a corner case. > Combinable reduce produces wrong result > --- > > Key: FLINK-2883 > URL: https://issues.apache.org/jira/browse/FLINK-2883 > Project: Flink > Issue Type: Bug >Affects Versions: 0.10 >Reporter: Till Rohrmann > > If one uses a combinable reduce operation which also changes the key value of > the underlying data element, then the results of the reduce operation can > become wrong. The reason is that after the combine phase, another reduce > operator is executed which will then reduce the elements based on the new key > values. This might be not so surprising if one explicitly defined ones > {{GroupReduceOperation}} as combinable. However, the {{ReduceFunction}} > conceals the fact that a combiner is used implicitly. Furthermore, the API > does not prevent the user from changing the key fields which could solve the > problem. > The following example program illustrates the problem > {code} > val env = ExecutionEnvironment.getExecutionEnvironment > env.setParallelism(1) > val input = env.fromElements((1,2), (1,3), (2,3), (3,3), (3,4)) > val result = input.groupBy(0).reduce{ > (left, right) => > (left._1 + right._1, left._2 + right._2) > } > result.output(new PrintingOutputFormat[Int]()) > env.execute() > {code} > The expected output is > {code} > (2, 5) > (2, 3) > (6, 7) > {code} > However, the actual output is > {code} > (4, 8) > (6, 7) > {code} > I think that the underlying problem is that associativity and commutativity > is not sufficient for a combinable reduce operation. Additionally we also > need to make sure that the key stays the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2858) Cannot build Flink Scala 2.11 with IntelliJ
[ https://issues.apache.org/jira/browse/FLINK-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965087#comment-14965087 ] Till Rohrmann commented on FLINK-2858: -- Thanks for your help [~aalexandrov] :-) I did not know how to properly change the Scala version. I'm ok with changing the version via the shell script and documenting it on the web site. > Cannot build Flink Scala 2.11 with IntelliJ > --- > > Key: FLINK-2858 > URL: https://issues.apache.org/jira/browse/FLINK-2858 > Project: Flink > Issue Type: Bug > Components: Build System >Affects Versions: 0.10 >Reporter: Till Rohrmann > > If I activate the scala-2.11 profile from within IntelliJ (and thus > deactivate the scala-2.10 profile) in order to build Flink with Scala 2.11, > then Flink cannot be built. The problem is that some Scala macros cannot be > expanded because they were compiled with the wrong version (I assume 2.10). > This makes debugging tests with Scala 2.11 in IntelliJ impossible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2804) Support blocking job submission with Job Manager recovery
[ https://issues.apache.org/jira/browse/FLINK-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann reassigned FLINK-2804: Assignee: Till Rohrmann (was: Ufuk Celebi) > Support blocking job submission with Job Manager recovery > - > > Key: FLINK-2804 > URL: https://issues.apache.org/jira/browse/FLINK-2804 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi >Assignee: Till Rohrmann >Priority: Minor > > Submitting a job in a blocking fashion with JobManager recovery and a failing > JobManager fails on the client side (the one submitting the job). The job > still continues to be recovered. > I propose to add simple support to re-retrieve the leading job manager and > update the client actor with it and then wait for the result as before. > As of the current standing in PR #1153 > (https://github.com/apache/flink/pull/1153) the job manager assumes that the > same actor is running and just keeps on sending execution state updates etc. > (if the listening behaviour is not detached). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2800) kryo serialization problem
[ https://issues.apache.org/jira/browse/FLINK-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944631#comment-14944631 ] Till Rohrmann commented on FLINK-2800: -- Hi [~stefano.bortoli], do you have small code example to reproduce the problem? Or does it happen with any cross operation? > kryo serialization problem > -- > > Key: FLINK-2800 > URL: https://issues.apache.org/jira/browse/FLINK-2800 > Project: Flink > Issue Type: Bug > Components: Type Serialization System >Affects Versions: master > Environment: linux ubuntu 12.04 LTS, Java 7 >Reporter: Stefano Bortoli > > Performing a cross of two dataset of POJOs I have got the exception below. > The first time I run the process, there was no problem. When I run it the > second time, I have got the exception. My guess is that it could be a race > condition related to the reuse of the Kryo serializer object. However, it > could also be "a bug where type registrations are not properly forwarded to > all Serializers", as suggested by Stephan. > > 2015-10-01 18:18:21 INFO JobClient:161 - 10/01/2015 18:18:21 Cross(Cross at > main(FlinkMongoHadoop2LinkPOI2CDA.java:160))(3/4) switched to FAILED > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 114 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752) > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:210) > at > org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:127) > at > org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30) > at > org.apache.flink.runtime.operators.resettable.AbstractBlockResettableIterator.getNextRecord(AbstractBlockResettableIterator.java:180) > at > org.apache.flink.runtime.operators.resettable.BlockResettableMutableObjectIterator.next(BlockResettableMutableObjectIterator.java:111) > at > org.apache.flink.runtime.operators.CrossDriver.runBlockedOuterSecond(CrossDriver.java:309) > at > org.apache.flink.runtime.operators.CrossDriver.run(CrossDriver.java:162) > at > org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:489) > at > org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2800) kryo serialization problem
[ https://issues.apache.org/jira/browse/FLINK-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944659#comment-14944659 ] Till Rohrmann commented on FLINK-2800: -- If I'm not mistaken, then this code example is not complete. Would be great if you could fill in the gaps like {{POI2CDACombineGroupFunction}}, {{AddProxy2POIReduceGroupFunction}} and {{GetEntitonForClass}} or if you could distill the whole example down to something like {code} DataSet> input1 = env.createFromElements() DataSet > input2 = env.createFromElements() input1.cross(input2).print() {code} if this reproduces the problem for you. Thanks for your help. > kryo serialization problem > -- > > Key: FLINK-2800 > URL: https://issues.apache.org/jira/browse/FLINK-2800 > Project: Flink > Issue Type: Bug > Components: Type Serialization System >Affects Versions: master > Environment: linux ubuntu 12.04 LTS, Java 7 >Reporter: Stefano Bortoli > > Performing a cross of two dataset of POJOs I have got the exception below. > The first time I run the process, there was no problem. When I run it the > second time, I have got the exception. My guess is that it could be a race > condition related to the reuse of the Kryo serializer object. However, it > could also be "a bug where type registrations are not properly forwarded to > all Serializers", as suggested by Stephan. > > 2015-10-01 18:18:21 INFO JobClient:161 - 10/01/2015 18:18:21 Cross(Cross at > main(FlinkMongoHadoop2LinkPOI2CDA.java:160))(3/4) switched to FAILED > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 114 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752) > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:210) > at > org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:127) > at > org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30) > at > org.apache.flink.runtime.operators.resettable.AbstractBlockResettableIterator.getNextRecord(AbstractBlockResettableIterator.java:180) > at > org.apache.flink.runtime.operators.resettable.BlockResettableMutableObjectIterator.next(BlockResettableMutableObjectIterator.java:111) > at > org.apache.flink.runtime.operators.CrossDriver.runBlockedOuterSecond(CrossDriver.java:309) > at > org.apache.flink.runtime.operators.CrossDriver.run(CrossDriver.java:162) > at > org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:489) > at > org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2790) Add high availability support for Yarn
[ https://issues.apache.org/jira/browse/FLINK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944675#comment-14944675 ] Till Rohrmann commented on FLINK-2790: -- What did the logs say? I try to reproduce it. On Mon, Oct 5, 2015 at 5:50 PM, ASF GitHub Bot (JIRA)> Add high availability support for Yarn > -- > > Key: FLINK-2790 > URL: https://issues.apache.org/jira/browse/FLINK-2790 > Project: Flink > Issue Type: Sub-task > Components: JobManager, TaskManager >Reporter: Till Rohrmann > Fix For: 0.10 > > > Add master high availability support for Yarn. The idea is to let Yarn > restart a failed application master in a new container. For that, we set the > number of application retries to something greater than 1. > From version 2.4.0 onwards, it is possible to reuse already started > containers for the TaskManagers, thus, avoiding unnecessary restart delays. > From version 2.6.0 onwards, it is possible to specify an interval in which > the number of application attempts have to be exceeded in order to fail the > job. This will prevent long running jobs from eventually depleting all > available application attempts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2329) Refactor RPCs from within the ExecutionGraph
Till Rohrmann created FLINK-2329: Summary: Refactor RPCs from within the ExecutionGraph Key: FLINK-2329 URL: https://issues.apache.org/jira/browse/FLINK-2329 Project: Flink Issue Type: Sub-task Reporter: Till Rohrmann Assignee: Till Rohrmann Currently, we store an {{ActorRef}} of the TaskManager into an {{Instance}} object. This {{ActorRef}} is used from within {{Executions}} to interact with the {{TaskManager}}. This is not a nice abstraction since it does not hide implementation details. Since we need to add a leader session ID to messages sent by the {{Executions}} in order to support high availability, we would need to make the leader session ID available to the {{Execution}}. A better solution seems to be to replace the direct {{ActorRef}} interaction with an instance gateway abstraction which encapsulates the communication logic. Having such an abstraction, it will be easy to decorate messages transparently with a leader session ID. Therefore, I propose to refactor the current {{Instance}} communication and to introduce an {{InstanceGateway}} abstraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2332) Assign session IDs to JobManager and TaskManager messages
Till Rohrmann created FLINK-2332: Summary: Assign session IDs to JobManager and TaskManager messages Key: FLINK-2332 URL: https://issues.apache.org/jira/browse/FLINK-2332 Project: Flink Issue Type: Sub-task Reporter: Till Rohrmann Assignee: Till Rohrmann In order to support true high availability {{TaskManager}} and {{JobManager}} have to be able to distinguish whether a message was sent from the leader or whether a message was sent from a former leader. Messages which come from a former leader have to be discarded in order to guarantee a consistent state. A way to do achieve this is to assign a leader session ID to a {{JobManager}} once he's elected as leader. This leader session ID is sent to the {{TaskManager}} upon registration at the {{JobManager}}. All subsequent messages should then be decorated with this leader session ID to mark them as valid. On the {{TaskManager}} side the received leader session ID as a response to the registration message, can then be used to validate incoming messages. The same holds true for registration messages which should have a registration session ID, too. That way, it is possible to distinguish invalid registration messages from valid ones. The registration session ID can be assigned once the TaskManager is notified about the new leader. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2162) Implement adaptive learning rate strategies for SGD
[ https://issues.apache.org/jira/browse/FLINK-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-2162: - Assignee: Ventura Del Monte Implement adaptive learning rate strategies for SGD --- Key: FLINK-2162 URL: https://issues.apache.org/jira/browse/FLINK-2162 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Ventura Del Monte Priority: Minor Labels: ML At the moment, the SGD implementation uses a simple adaptive learning rate strategy, {{adaptedLearningRate = initialLearningRate/sqrt(iterationNumber)}}, which makes the optimization algorithm sensitive to the setting of the {{initialLearningRate}}. If this value is chosen wrongly, then the SGD might become instable. There are better ways to calculate the learning rate [1] such as Adagrad [3], Adadelta [4], SGD with momentum [5] others [2]. They promise to result in more stable optimization algorithms which don't require so much hyperparameter tweaking. It might be worthwhile to investigate these approaches. It might also be interesting to look at the implementation of vowpal wabbit [6]. Resources: [1] [http://imgur.com/a/Hqolp] [2] [http://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html] [3] [http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf] [4] [http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf] [5] [http://www.willamette.edu/~gorr/classes/cs449/momrate.html] [6] [https://github.com/JohnLangford/vowpal_wabbit] -- This message was sent by Atlassian JIRA (v6.3.4#6332)