[jira] [Resolved] (FLINK-2034) Add vision and roadmap for ML library to docs

2015-05-22 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2034.
--
Resolution: Fixed

Added via b602b2ee1c9d130e97e844572f9827b29fbd9cf8

 Add vision and roadmap for ML library to docs
 -

 Key: FLINK-2034
 URL: https://issues.apache.org/jira/browse/FLINK-2034
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
  Labels: ML
 Fix For: 0.9


 We should have a document describing the vision of the Machine Learning 
 library in Flink and an up to date roadmap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-1992) Add convergence criterion to SGD optimizer

2015-05-22 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-1992.

Resolution: Fixed

Added via b3b6a9da0532d884f7d633530c11cda15aa6bc1b

 Add convergence criterion to SGD optimizer
 --

 Key: FLINK-1992
 URL: https://issues.apache.org/jira/browse/FLINK-1992
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Theodore Vasiloudis
Priority: Minor
  Labels: ML
 Fix For: 0.9


 Currently, Flink's SGD optimizer runs for a fixed number of iterations. It 
 would be good to support a dynamic convergence criterion, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2050) Add pipelining mechanism for chainable transformers and estimators

2015-05-22 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2050.

Resolution: Fixed

Added via fde0341fe16c7258e42f77e289a557157995830c

 Add pipelining mechanism for chainable transformers and estimators
 --

 Key: FLINK-2050
 URL: https://issues.apache.org/jira/browse/FLINK-2050
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Till Rohrmann
  Labels: ML
 Fix For: 0.9


 The key concept of an easy to use ML library is the quick and simple 
 construction of data analysis pipelines. Scikit-learn's approach to define 
 transformers and estimators seems to be a really good solution to this 
 problem. I propose to follow a similar path, because it makes FlinkML 
 flexible in terms of code reuse as well as easy for people coming from 
 Scikit-learn to use the FlinkML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1999) TF-IDF transformer

2015-05-25 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558224#comment-14558224
 ] 

Till Rohrmann commented on FLINK-1999:
--

You can do it as you like and what's easier for you.

 TF-IDF transformer
 --

 Key: FLINK-1999
 URL: https://issues.apache.org/jira/browse/FLINK-1999
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Ronny Bräunlich
Assignee: Alexander Alexandrov
Priority: Minor
  Labels: ML

 Hello everybody,
 we are a group of three students from TU Berlin (I guess we're not the first 
 group creating an issue) and we want to/have to implement a tf-idf tranformer 
 for Flink.
 Our lecturer Alexander told us that we could get some guidance here and that 
 you could point us to an old version of a similar tranformer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2056) Add guide to create a chainable predictor in docs

2015-05-26 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2056.
--
Resolution: Fixed

Added with 48e2cb5e8be7c4f305b947fb25ea7d312844e032

 Add guide to create a chainable predictor in docs
 -

 Key: FLINK-2056
 URL: https://issues.apache.org/jira/browse/FLINK-2056
 Project: Flink
  Issue Type: Sub-task
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Minor
 Fix For: 0.9


 The upcoming API for pipelines should have good documentation to guide and 
 encourage the implementation of more algorithms.
 For this task we will create a guide that shows how the pipeline mechanism 
 works through Scala implicits, and a full guide to implementing a chainable 
 predictor, using Generalized Linear Models as an example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2090) toString of CollectionInputFormat takes long time when the collection is huge

2015-05-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2090:


 Summary: toString of CollectionInputFormat takes long time when 
the collection is huge
 Key: FLINK-2090
 URL: https://issues.apache.org/jira/browse/FLINK-2090
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Priority: Minor


The {{toString}} method of {{CollectionInputFormat}} calls {{toString}} on its 
underlying {{Collection}}. Thus, {{toString}} is called for each element of the 
collection. If the {{Collection}} contains many elements or the individual 
{{toString}} calls for each element take a long time, then the string 
generation can take a considerable amount of time. [~mikiobraun] noticed that 
when he inserted several jBLAS matrices into Flink.

The {{toString}} method is mainly used for logging statements in 
{{DataSourceNode}}'s {{computeOperatorSpecificDefaultEstimates}} method and in 
{{JobGraphGenerator.getDescriptionForUserCode}}. I'm wondering whether it is 
necessary to print the complete content of the underlying {{Collection}} or if 
it's not enough to print only the first 3 elements in the {{toString}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2053) Preregister ML types for Kryo serialization

2015-05-26 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2053.

Resolution: Fixed

Added with ae446388b91ecc0f08887da19400395b96b32f6c

 Preregister ML types for Kryo serialization
 ---

 Key: FLINK-2053
 URL: https://issues.apache.org/jira/browse/FLINK-2053
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Till Rohrmann
  Labels: ML
 Fix For: 0.9


 Currently, FlinkML uses interfaces and abstract types to implement generic 
 algorithms. As a consequence we have to use Kryo to serialize the effective 
 subtypes. In order to speed the data transfer up, it's necessary to 
 preregister these types in order to assign them fixed IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-2073) Add contribution guide for FlinkML

2015-05-26 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-2073:


Assignee: Till Rohrmann

 Add contribution guide for FlinkML
 --

 Key: FLINK-2073
 URL: https://issues.apache.org/jira/browse/FLINK-2073
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Till Rohrmann
 Fix For: 0.9


 We need a guide for contributions to FlinkML in order to encourage the 
 extension of the library, and provide guidelines for developers.
 One thing that should be included is a step-by-step guide to create a 
 transformer, or other Estimator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2075) Shade akka and protobuf dependencies away

2015-05-21 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2075:


 Summary: Shade akka and protobuf dependencies away
 Key: FLINK-2075
 URL: https://issues.apache.org/jira/browse/FLINK-2075
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
 Fix For: 0.9


Lately, the Zeppelin project encountered the following problem: It includes 
flink-runtime which depends on akka_remote:2.3.7 which again depends on 
protobuf-java:2.5.0. However, Zeppelin set the protobuf-java version to 2.4.1 
to make it build with YARN 2.2. Due to this, akka_remote finds a wrong 
protobuf-java version and fails because of an incompatible change between these 
versions.

I propose to shade Flink's akka dependency and protobuf dependency away, so 
that user projects depending on Flink are not forced to use a special 
akka/protobuf version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2035) Update 0.9 roadmap with ML issues

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2035.
--
Resolution: Fixed

Marked corresponding jira issues and added them to the google doc.

 Update 0.9 roadmap with ML issues
 -

 Key: FLINK-2035
 URL: https://issues.apache.org/jira/browse/FLINK-2035
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Till Rohrmann
 Fix For: 0.9


 The [current 
 list|https://issues.apache.org/jira/browse/FLINK-2001?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%200.9%20AND%20component%20%3D%20%22Machine%20Learning%20Library%22]
  of issues linked with the 0.9 release is quite limited.
 We should go through the current ML issues and assign fix versions, so that 
 we have a clear view of what we expect to have in 0.9.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2083) Ensure high quality docs for FlinkML in 0.9

2015-05-26 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2083.
--
Resolution: Fixed

Improved with b015a32f6126d759fc6dee90b78f90f7ff8dfbac

 Ensure high quality docs for FlinkML in 0.9
 ---

 Key: FLINK-2083
 URL: https://issues.apache.org/jira/browse/FLINK-2083
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
  Labels: ML
 Fix For: 0.9


 As defined in our vision for FlinkML, providing high-quality documentation is 
 a primary goal for us.
 This issue concerns the docs that will be included in 0.9, and will track 
 improvements and additions for the release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-08-13 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1745:
-
Assignee: (was: Till Rohrmann)

 Add exact k-nearest-neighbours algorithm to machine learning library
 

 Key: FLINK-1745
 URL: https://issues.apache.org/jira/browse/FLINK-1745
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
  Labels: ML, Starter

 Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
 it is still used as a mean to classify data and to do regression. This issue 
 focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
 proposed in [2].
 Could be a starter task.
 Resources:
 [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
 [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2521) Add automatic test name logging for tests

2015-08-18 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2521.

Resolution: Fixed

Added via 2f0412f163f4d37605188c8cc763111e0b51f0dc

 Add automatic test name logging for tests
 -

 Key: FLINK-2521
 URL: https://issues.apache.org/jira/browse/FLINK-2521
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Till Rohrmann
Priority: Minor

 When running tests on travis the Flink components log to a file. This is 
 helpful in case of a failed test to retrieve the error. However, the log does 
 not contain the test name and the reason for the failure. Therefore it is 
 difficult to find the log output which corresponds to the failed test.
 It would be nice to automatically add the test case information to the log. 
 This would ease the debugging process big time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1994) Add different gain calculation schemes to SGD

2015-08-18 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701225#comment-14701225
 ] 

Till Rohrmann commented on FLINK-1994:
--

Sounds great [~rawkintrevo]. When it's ready, then you should open a PR against 
Flink's master branch on github. 

 Add different gain calculation schemes to SGD
 -

 Key: FLINK-1994
 URL: https://issues.apache.org/jira/browse/FLINK-1994
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Trevor Grant
Priority: Minor
  Labels: ML, Starter

 The current SGD implementation uses as gain for the weight updates the 
 formula {{stepsize/sqrt(iterationNumber)}}. It would be good to make the gain 
 calculation configurable and to provide different strategies for that. For 
 example:
 * stepsize/(1 + iterationNumber)
 * stepsize*(1 + regularization * stepsize * iterationNumber)^(-3/4)
 See also how to properly select the gains [1].
 Resources:
 [1] http://arxiv.org/pdf/1107.2490.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-08-19 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702892#comment-14702892
 ] 

Till Rohrmann commented on FLINK-2366:
--

Do you mind taking the lead for the alternative HA service implementations 
[~sirinath19...@gmail.com]?

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-08-19 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703052#comment-14703052
 ] 

Till Rohrmann commented on FLINK-2366:
--

Great to hear [~sirinath19...@gmail.com] :-) If you take a look at the PR #1016 
on github (https://github.com/apache/flink/pull/1016) you'll find the 
definition of the SPI. Most notably the {{LeaderElectionService}} and the 
{{LeaderRetrievalService}} interfaces will be of your interest. Those are the 
services which you have to implement to add a new backend for HA. 

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2549) Add topK operator for DataSet

2015-08-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704565#comment-14704565
 ] 

Till Rohrmann edited comment on FLINK-2549 at 8/20/15 9:17 AM:
---

I agree with [~StephanEwen]. Sorting the complete input with n elements has a 
complexity of O(n * log( n )) whereas keeping the k top most elements in a 
priority queue gives you in worst case O(n * log( k )). Assuming k  n, then 
this is worth the effort.


was (Author: till.rohrmann):
I agree with [~StephanEwen]. Sorting the complete input with n elements has a 
complexity of O(n * log(n)) whereas keeping the k top most elements in a 
priority queue gives you in worst case O(n * log(k)). Assuming k  n, then 
this is worth the effort.

 Add topK operator for DataSet
 -

 Key: FLINK-2549
 URL: https://issues.apache.org/jira/browse/FLINK-2549
 Project: Flink
  Issue Type: New Feature
  Components: Core, Java API, Scala API
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor

 topK is a common operation for user, it would be great to have it in Flink. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2549) Add topK operator for DataSet

2015-08-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704565#comment-14704565
 ] 

Till Rohrmann commented on FLINK-2549:
--

I agree with [~StephanEwen]. Sorting the complete input with n elements has a 
complexity of O(n * log(n)) whereas keeping the k top most elements in a 
priority queue gives you in worst case O(n * log(k)). Assuming k  n, then 
this is worth the effort.

 Add topK operator for DataSet
 -

 Key: FLINK-2549
 URL: https://issues.apache.org/jira/browse/FLINK-2549
 Project: Flink
  Issue Type: New Feature
  Components: Core, Java API, Scala API
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor

 topK is a common operation for user, it would be great to have it in Flink. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2549) Add topK operator for DataSet

2015-08-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706744#comment-14706744
 ] 

Till Rohrmann commented on FLINK-2549:
--

Is required to implement sample operator which works on Flink's managed memory.

 Add topK operator for DataSet
 -

 Key: FLINK-2549
 URL: https://issues.apache.org/jira/browse/FLINK-2549
 Project: Flink
  Issue Type: New Feature
  Components: Core, Java API, Scala API
Reporter: Chengxiang Li
Assignee: Chengxiang Li
Priority: Minor

 topK is a common operation for user, it would be great to have it in Flink. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-1901) Create sample operator for Dataset

2015-08-21 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-1901.
--
Resolution: Fixed

Added via c9cfb17cb095def8b8ea0ed1b598fc78b890b874

A fixed size sample operator working on Flink's managed memory will be 
implemented with FLINK-2549.

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative or exact size of the sample, set a seed for 
 reproducibility, and support sampling within iterations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-08-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706360#comment-14706360
 ] 

Till Rohrmann commented on FLINK-1745:
--

Sorry for the late reply, but I want to review the PR first.



 Add exact k-nearest-neighbours algorithm to machine learning library
 

 Key: FLINK-1745
 URL: https://issues.apache.org/jira/browse/FLINK-1745
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
  Labels: ML, Starter

 Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
 it is still used as a mean to classify data and to do regression. This issue 
 focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
 proposed in [2].
 Could be a starter task.
 Resources:
 [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
 [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2544) Some test cases using PowerMock fail with Java 8u20

2015-08-18 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2544:


 Summary: Some test cases using PowerMock fail with Java 8u20
 Key: FLINK-2544
 URL: https://issues.apache.org/jira/browse/FLINK-2544
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann
Priority: Minor


I observed that some of the test cases using {{PowerMockRunner}} fail with Java 
8u20 with the following error:

{code}
java.lang.VerifyError: Bad init method call from inside of a branch
Exception Details:
  Location:
org/apache/flink/client/program/ClientTest$SuccessReturningActor.init()V 
@32: invokespecial
  Reason:
Error exists in the bytecode
  Bytecode:
0x000: 2a4c 1214 b800 1a03 bd00 0d12 1bb8 001f
0x010: b800 254e 2db2 0029 a500 0e2a 01c0 002b
0x020: b700 2ea7 0009 2bb7 0030 0157 2a01 4c01
0x030: 4d01 4e2b 01a5 0008 2b4e a700 0912 32b8
0x040: 001a 4e2d 1234 03bd 000d 1236 b800 1f12
0x050: 32b8 003a 3a04 1904 b200 29a6 000a b800
0x060: 3c4d a700 0919 04c0 0011 4d2c b800 42b5
0x070: 0046 b1
  Stackmap Table:
full_frame(@38,{UninitializedThis,UninitializedThis,Top,Object[#13]},{})
full_frame(@44,{Object[#2],Object[#2],Top,Object[#13]},{})
full_frame(@61,{Object[#2],Null,Null,Null},{Object[#2]})
full_frame(@67,{Object[#2],Null,Null,Object[#15]},{Object[#2]})
full_frame(@101,{Object[#2],Null,Null,Object[#15],Object[#13]},{Object[#2]})

full_frame(@107,{Object[#2],Null,Object[#17],Object[#15],Object[#13]},{Object[#2]})

at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2658)
at java.lang.Class.getConstructor0(Class.java:3062)
at java.lang.Class.getDeclaredConstructor(Class.java:2165)
at akka.util.Reflect$$anonfun$4.apply(Reflect.scala:86)
at akka.util.Reflect$$anonfun$4.apply(Reflect.scala:86)
at scala.util.Try$.apply(Try.scala:161)
at akka.util.Reflect$.findConstructor(Reflect.scala:86)
at akka.actor.NoArgsReflectConstructor.init(Props.scala:359)
at akka.actor.IndirectActorProducer$.apply(Props.scala:308)
at akka.actor.Props.producer(Props.scala:176)
at akka.actor.Props.init(Props.scala:189)
at akka.actor.Props$.create(Props.scala:99)
at akka.actor.Props$.create(Props.scala:99)
at akka.actor.Props.create(Props.scala)
at 
org.apache.flink.client.program.ClientTest.shouldSubmitToJobClient(ClientTest.java:143)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.internal.runners.TestMethod.invoke(TestMethod.java:68)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.runTestMethod(PowerMockJUnit44RunnerDelegateImpl.java:310)
at org.junit.internal.runners.MethodRoadie$2.run(MethodRoadie.java:88)
at 
org.junit.internal.runners.MethodRoadie.runBeforesThenTestThenAfters(MethodRoadie.java:96)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.executeTest(PowerMockJUnit44RunnerDelegateImpl.java:294)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.executeTestInSuper(PowerMockJUnit47RunnerDelegateImpl.java:127)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit47RunnerDelegateImpl$PowerMockJUnit47MethodRunner.executeTest(PowerMockJUnit47RunnerDelegateImpl.java:82)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$PowerMockJUnit44MethodRunner.runBeforesThenTestThenAfters(PowerMockJUnit44RunnerDelegateImpl.java:282)
at org.junit.internal.runners.MethodRoadie.runTest(MethodRoadie.java:86)
at org.junit.internal.runners.MethodRoadie.run(MethodRoadie.java:49)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.invokeTestMethod(PowerMockJUnit44RunnerDelegateImpl.java:207)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.runMethods(PowerMockJUnit44RunnerDelegateImpl.java:146)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl$1.run(PowerMockJUnit44RunnerDelegateImpl.java:120)
at 
org.junit.internal.runners.ClassRoadie.runUnprotected(ClassRoadie.java:33)
at 
org.junit.internal.runners.ClassRoadie.runProtected(ClassRoadie.java:45)
at 
org.powermock.modules.junit4.internal.impl.PowerMockJUnit44RunnerDelegateImpl.run(PowerMockJUnit44RunnerDelegateImpl.java:118)
at 

[jira] [Updated] (FLINK-2564) Failing Test: RandomSamplerTest

2015-08-24 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2564:
-
Labels: test-stability  (was: )

 Failing Test: RandomSamplerTest
 ---

 Key: FLINK-2564
 URL: https://issues.apache.org/jira/browse/FLINK-2564
 Project: Flink
  Issue Type: Bug
Reporter: Matthias J. Sax
  Labels: test-stability

 {noformat}
 Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 15.943 sec 
  FAILURE! - in org.apache.flink.api.java.sampling. 
 testPoissonSamplerFraction(org.apache.flink.api.java.sampling.RandomSamplerTest)
  Time elapsed: 0.017 sec  FAILURE!
 java.lang.AssertionError: expected fraction: 0.01, result fraction: 
 0.011300
 at org.junit.Assert.fail(Assert.java:88)
 at org.junit.Assert.assertTrue(Assert.java:41)
 at 
 org.apache.flink.api.java.sampling.RandomSamplerTest.verifySamplerFraction(RandomSamplerTest.java:249)
 at 
 org.apache.flink.api.java.sampling.RandomSamplerTest.testPoissonSamplerFraction(RandomSamplerTest.java:116)
 Results :
 Failed tests:
 Successfully installed excon-0.33.0
 RandomSamplerTest.testPoissonSamplerFraction:116-verifySamplerFraction:249 
 expected fraction: 0.01, result fraction: 0.011300
 {noformat}
 Full log: https://travis-ci.org/apache/flink/jobs/76720572



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-08-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704520#comment-14704520
 ] 

Till Rohrmann commented on FLINK-2366:
--

I'll let you know when this happened :-)

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2066) Make delay between execution retries configurable

2015-06-29 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2066:
-
Assignee: Nuno Miguel Marques dos Santos

 Make delay between execution retries configurable
 -

 Key: FLINK-2066
 URL: https://issues.apache.org/jira/browse/FLINK-2066
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Nuno Miguel Marques dos Santos
  Labels: starter

 Flink allows to specify a delay between execution retries. This helps to let 
 some external failure causes fully manifest themselves before the restart is 
 attempted.
 The delay is currently defined only system wide.
 We should add it to the {{ExecutionConfig}} of a job to allow per-job 
 specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2289) Make JobManager highly available

2015-06-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2289:


 Summary: Make JobManager highly available
 Key: FLINK-2289
 URL: https://issues.apache.org/jira/browse/FLINK-2289
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Till Rohrmann


Currently, the {{JobManager}} is the single point of failure in the Flink 
system. If it fails, then your job cannot be recovered and the Flink cluster is 
no longer able to receive new jobs.

Therefore, it is crucial to make the {{JobManager}} fault tolerant so that the 
Flink cluster can recover from failed {{JobManager}}. As a first step towards 
this goal, I propose to make the {{JobManager}} highly available by starting 
multiple instances and using Apache ZooKeeper to elect a leader. The leader is 
responsible for the execution of the Flink job. 

In case that the {{JobManager}} dies, one of the other running {{JobManager}} 
will be elected as the leader and take over the role of the leader. The 
{{Client}} and the {{TaskManager}} will automatically detect the new 
{{JobManager}} by querying the ZooKeeper cluster.

Note that this does not achieve full fault tolerance for the {{JobManager}} but 
it allows the cluster to recover from failed {{JobManager}}. The design of 
high-availability for the {{JobManager}} is tracked in the wiki here [1].

Resources:
[1] 
[https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-2278) SparseVector created from Breeze Sparsevector has wrong size

2015-06-26 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-2278:


Assignee: Till Rohrmann

 SparseVector created from Breeze Sparsevector has wrong size
 

 Key: FLINK-2278
 URL: https://issues.apache.org/jira/browse/FLINK-2278
 Project: Flink
  Issue Type: Bug
Reporter: Christoph Alt
Assignee: Till Rohrmann
  Labels: ML

 The following code doesn't return true when testing equality of two 
 SparseVectors, one converted from a Breeze SparseVector.
 {code}
 val flinkVector = SparseVector.fromCOO(3, (1, 1.0), (2, 2.0))
 val breezeVector = linalg.SparseVector(3)(1 - 1.0, 2 - 2.0)
 flinkVector.equalsVector(breezeVector.fromBreeze)
 {code}
 The reason is that *fromBreeze* takes the number of non-zero elements 
 *SparseVector.used* as size when creating a SparseVector instead of the 
 dimensionality *SparseVector.length*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2278) SparseVector created from Breeze Sparsevector has wrong size

2015-06-26 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2278.
--
Resolution: Fixed

Fixed via bc3684e69aff1a73f7fb3a62b097e9fbb311cd71

 SparseVector created from Breeze Sparsevector has wrong size
 

 Key: FLINK-2278
 URL: https://issues.apache.org/jira/browse/FLINK-2278
 Project: Flink
  Issue Type: Bug
Reporter: Christoph Alt
  Labels: ML

 The following code doesn't return true when testing equality of two 
 SparseVectors, one converted from a Breeze SparseVector.
 {code}
 val flinkVector = SparseVector.fromCOO(3, (1, 1.0), (2, 2.0))
 val breezeVector = linalg.SparseVector(3)(1 - 1.0, 2 - 2.0)
 flinkVector.equalsVector(breezeVector.fromBreeze)
 {code}
 The reason is that *fromBreeze* takes the number of non-zero elements 
 *SparseVector.used* as size when creating a SparseVector instead of the 
 dimensionality *SparseVector.length*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1994) Add different gain calculation schemes to SGD

2015-07-31 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1994:
-
Assignee: Trevor Grant

 Add different gain calculation schemes to SGD
 -

 Key: FLINK-1994
 URL: https://issues.apache.org/jira/browse/FLINK-1994
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Trevor Grant
Priority: Minor
  Labels: ML, Starter

 The current SGD implementation uses as gain for the weight updates the 
 formula {{stepsize/sqrt(iterationNumber)}}. It would be good to make the gain 
 calculation configurable and to provide different strategies for that. For 
 example:
 * stepsize/(1 + iterationNumber)
 * stepsize*(1 + regularization * stepsize * iterationNumber)^(-3/4)
 See also how to properly select the gains [1].
 Resources:
 [1] http://arxiv.org/pdf/1107.2490.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1994) Add different gain calculation schemes to SGD

2015-07-31 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649172#comment-14649172
 ] 

Till Rohrmann commented on FLINK-1994:
--

Great to hear [~rawkintrevo]. I assigned the issue to you.

 Add different gain calculation schemes to SGD
 -

 Key: FLINK-1994
 URL: https://issues.apache.org/jira/browse/FLINK-1994
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Trevor Grant
Priority: Minor
  Labels: ML, Starter

 The current SGD implementation uses as gain for the weight updates the 
 formula {{stepsize/sqrt(iterationNumber)}}. It would be good to make the gain 
 calculation configurable and to provide different strategies for that. For 
 example:
 * stepsize/(1 + iterationNumber)
 * stepsize*(1 + regularization * stepsize * iterationNumber)^(-3/4)
 See also how to properly select the gains [1].
 Resources:
 [1] http://arxiv.org/pdf/1107.2490.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2478) The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link

2015-08-04 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653209#comment-14653209
 ] 

Till Rohrmann commented on FLINK-2478:
--

Thanks for reporting the broken link Slim. Will fix it immediately.

 The page “FlinkML - Machine Learning for Flink“  
 https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a 
 dead link
 -

 Key: FLINK-2478
 URL: https://issues.apache.org/jira/browse/FLINK-2478
 Project: Flink
  Issue Type: Task
  Components: Documentation
Affects Versions: 0.10
Reporter: Slim Baltagi
Assignee: Till Rohrmann
Priority: Minor

 Note that FlinkML is currently not part of the binary distribution. See 
 linking with it for cluster execution here.
 'here' links to a dead link: 
 https://ci.apache.org/projects/flink/flink-docs-master/libs/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution
 The correct link is: 
 https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-2478) The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link

2015-08-04 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-2478:


Assignee: Till Rohrmann

 The page “FlinkML - Machine Learning for Flink“  
 https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a 
 dead link
 -

 Key: FLINK-2478
 URL: https://issues.apache.org/jira/browse/FLINK-2478
 Project: Flink
  Issue Type: Task
  Components: Documentation
Affects Versions: 0.10
Reporter: Slim Baltagi
Assignee: Till Rohrmann
Priority: Minor

 Note that FlinkML is currently not part of the binary distribution. See 
 linking with it for cluster execution here.
 'here' links to a dead link: 
 https://ci.apache.org/projects/flink/flink-docs-master/libs/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution
 The correct link is: 
 https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2478) The page “FlinkML - Machine Learning for Flink“ https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a dead link

2015-08-04 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2478.
--
Resolution: Fixed

Fixed via 77b7471580ce9cada86e32c2b6919086ed2eb730

 The page “FlinkML - Machine Learning for Flink“  
 https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/ contains a 
 dead link
 -

 Key: FLINK-2478
 URL: https://issues.apache.org/jira/browse/FLINK-2478
 Project: Flink
  Issue Type: Task
  Components: Documentation
Affects Versions: 0.10
Reporter: Slim Baltagi
Assignee: Till Rohrmann
Priority: Minor

 Note that FlinkML is currently not part of the binary distribution. See 
 linking with it for cluster execution here.
 'here' links to a dead link: 
 https://ci.apache.org/projects/flink/flink-docs-master/libs/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution
 The correct link is: 
 https://ci.apache.org/projects/flink/flink-docs-master/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2469) JobManager crashes on Cancel

2015-08-03 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2469.

Resolution: Duplicate

FLINK-2409 already covers the issue.

 JobManager crashes on Cancel
 

 Key: FLINK-2469
 URL: https://issues.apache.org/jira/browse/FLINK-2469
 Project: Flink
  Issue Type: Bug
Reporter: Matthias J. Sax

 In local mode, JobManager crashes if job is canceled via 
 JobManger-WebFrontend.
 The log shows the following error:
 13:19:34,722 ERROR akka.actor.OneForOneStrategy   
- Received a message CancelJob(948b32f3fb3f5cbd542123e9aff14013) without a 
 leader session ID, even though it requires to have one.
 java.lang.Exception: Received a message 
 CancelJob(948b32f3fb3f5cbd542123e9aff14013) without a leader session ID, even 
 though it requires to have one.
   at 
 org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:49)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
   at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
   at 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
   at 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
   at 
 org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
   at akka.dispatch.Mailbox.run(Mailbox.scala:221)
   at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Not sure, if cluster mode is affected, too (could not try it out, but would 
 assume yes). CliFrontend is *not* affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2401) Replace ActorRefs with ActorGateway in web server

2015-08-03 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2401.

Resolution: Duplicate

See FLINK-2409

 Replace ActorRefs with ActorGateway in web server
 -

 Key: FLINK-2401
 URL: https://issues.apache.org/jira/browse/FLINK-2401
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Priority: Minor

 The web server is the only remaining component which uses {{ActorRefs}} 
 directly to communicate with Flink actors. They should be replaced by 
 {{ActorGateways}} which allow the automatic decoration of messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2409) Old JM web interface is sending cancel messages w/o leader ID

2015-08-03 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2409.

Resolution: Fixed

Fixed via fab61a1954ff1554448e826e1d273689ed520fc3

 Old JM web interface is sending cancel messages w/o leader ID
 -

 Key: FLINK-2409
 URL: https://issues.apache.org/jira/browse/FLINK-2409
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Till Rohrmann

 {code}
 12:29:41,877 ERROR akka.actor.OneForOneStrategy   
- Received a message CancelJob(4b3631741c344881362ea46e29980ce4) without a 
 leader session ID, even though it requires to have one.
 java.lang.Exception: Received a message 
 CancelJob(4b3631741c344881362ea46e29980ce4) without a leader session ID, even 
 though it requires to have one.
 at 
 org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:49)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
 at 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 12:29:41,879 INFO  org.apache.flink.runtime.jobmanager.JobManager 
- Stopping JobManager akka://flink/user/jobmanager#-638215033.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2472) Make the JobClientActor check periodically if the submitted Job is still running and if the JobManager is still alive

2015-08-03 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2472:


 Summary: Make the JobClientActor check periodically if the 
submitted Job is still running and if the JobManager is still alive
 Key: FLINK-2472
 URL: https://issues.apache.org/jira/browse/FLINK-2472
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann


In case that the {{JobManager}} dies without notifying possibly connected 
{{JobClientActors}} or if the job execution finishes without sending the 
{{SerializedJobExecutionResult}} back to the {{JobClientActor}}, it might 
happen that a {{JobClient.submitJobAndWait}} never returns.

I propose to let the {{JobClientActor}} periodically check whether the 
{{JobManager}} is still alive and whether the submitted job is still running. 
If not, then the {{JobClientActor}} should return an exception to complete the 
waiting future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2472) Make the JobClientActor check periodically if the submitted Job is still running and if the JobManager is still alive

2015-08-03 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2472:
-
Issue Type: Improvement  (was: Bug)

 Make the JobClientActor check periodically if the submitted Job is still 
 running and if the JobManager is still alive
 -

 Key: FLINK-2472
 URL: https://issues.apache.org/jira/browse/FLINK-2472
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann

 In case that the {{JobManager}} dies without notifying possibly connected 
 {{JobClientActors}} or if the job execution finishes without sending the 
 {{SerializedJobExecutionResult}} back to the {{JobClientActor}}, it might 
 happen that a {{JobClient.submitJobAndWait}} never returns.
 I propose to let the {{JobClientActor}} periodically check whether the 
 {{JobManager}} is still alive and whether the submitted job is still running. 
 If not, then the {{JobClientActor}} should return an exception to complete 
 the waiting future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2504) ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed spuriously

2015-08-11 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681475#comment-14681475
 ] 

Till Rohrmann commented on FLINK-2504:
--

[~StephanEwen] the output of travis is all I got. There is apparently a problem 
with the watchdog script for my repository which prevented the uploading.

[~sachingoel0101] your stack trace looks similar to FLINK-1455.

 ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed 
 spuriously
 -

 Key: FLINK-2504
 URL: https://issues.apache.org/jira/browse/FLINK-2504
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann

 The test 
 {{ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed}} 
 failed in one of my Travis builds: 
 https://travis-ci.org/tillrohrmann/flink/jobs/74881883



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2504) ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed spuriously

2015-08-10 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2504:


 Summary: 
ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed failed 
spuriously
 Key: FLINK-2504
 URL: https://issues.apache.org/jira/browse/FLINK-2504
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann


The test 
{{ExternalSortLargeRecordsITCase.testSortWithLongAndShortRecordsMixed}} failed 
in one of my Travis builds: 
https://travis-ci.org/tillrohrmann/flink/jobs/74881883



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2521) Add automatic test name logging for tests

2015-08-14 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2521:


 Summary: Add automatic test name logging for tests
 Key: FLINK-2521
 URL: https://issues.apache.org/jira/browse/FLINK-2521
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Till Rohrmann
Priority: Minor


When running tests on travis the Flink components log to a file. This is 
helpful in case of a failed test to retrieve the error. However, the log does 
not contain the test name and the reason for the failure. Therefore it is 
difficult to find the log output which corresponds to the failed test.

It would be nice to automatically add the test case information to the log. 
This would ease the debugging process big time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2518) Avoid predetermination of ports for network services

2015-08-14 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2518:


 Summary: Avoid predetermination of ports for network services
 Key: FLINK-2518
 URL: https://issues.apache.org/jira/browse/FLINK-2518
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
 Fix For: 0.10


Some of Flink's network services use the {{NetUtils.getAvailablePort()}} to 
predetermine an available port for a service which is later started. This can 
lead to a race condition where two services have predetermined the same 
available port and later fail to instantiate because for one of them the port 
is already in use. 

This is, for example, the case for the {{NettyConnectionManager}} which is 
started after the {{TaskManager}} has registered at the {{JobManager}}. It 
would be better if we first start the network services with a random port, e.g. 
the {{NettyConnectionManager}}, and then send the bound port to the client. 
This will avoid problems like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-2409) Old JM web interface is sending cancel messages w/o leader ID

2015-07-27 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-2409:


Assignee: Till Rohrmann

 Old JM web interface is sending cancel messages w/o leader ID
 -

 Key: FLINK-2409
 URL: https://issues.apache.org/jira/browse/FLINK-2409
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Till Rohrmann

 {code}
 12:29:41,877 ERROR akka.actor.OneForOneStrategy   
- Received a message CancelJob(4b3631741c344881362ea46e29980ce4) without a 
 leader session ID, even though it requires to have one.
 java.lang.Exception: Received a message 
 CancelJob(4b3631741c344881362ea46e29980ce4) without a leader session ID, even 
 though it requires to have one.
 at 
 org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:49)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
 at 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 12:29:41,879 INFO  org.apache.flink.runtime.jobmanager.JobManager 
- Stopping JobManager akka://flink/user/jobmanager#-638215033.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-22 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636451#comment-14636451
 ] 

Till Rohrmann commented on FLINK-1901:
--

If you use the sampling operator this way, it works. However, usually your 
iteration data set is something like the weight vector of your model and you 
have another training dataset from which you want to take a small sample to 
update your weight vector in each iteration (e.g. SGD). When you write a 
program like that, then you'll see that the output of the sampling operator 
will always be the same (for every iteration). The reason is that the sampling 
no longer is on the dynamic path of the iteration and thus it is only once 
calculated and then cached. This is not the intended behaviour, though.

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635206#comment-14635206
 ] 

Till Rohrmann commented on FLINK-1901:
--

I think this solution is indeed a little bit too hacky. It would  be very 
unintuitive for the user having to broadcast the iteration {{DataSet}} to the 
sampling operator. Furthermore, this will inflict unnecessary network I/O.

I think we should try to solve this problem properly. This means that we have a 
single sampling operator which works inside and outside of iterations. This 
will also avoid code duplication.

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2332) Assign session IDs to JobManager and TaskManager messages

2015-07-23 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2332.

Resolution: Fixed

Added via 45428518d0e1b843947a6184b4a803a78ad5

 Assign session IDs to JobManager and TaskManager messages
 -

 Key: FLINK-2332
 URL: https://issues.apache.org/jira/browse/FLINK-2332
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 0.10


 In order to support true high availability {{TaskManager}} and {{JobManager}} 
 have to be able to distinguish whether a message was sent from the leader or 
 whether a message was sent from a former leader. Messages which come from a 
 former leader have to be discarded in order to guarantee a consistent state.
 A way to do achieve this is to assign a leader session ID to a {{JobManager}} 
 once he's elected as leader. This leader session ID is sent to the 
 {{TaskManager}} upon registration at the {{JobManager}}. All subsequent 
 messages should then be decorated with this leader session ID to mark them as 
 valid. On the {{TaskManager}} side the received leader session ID as a 
 response to the registration message, can then be used to validate incoming 
 messages.
 The same holds true for registration messages which should have a 
 registration session ID, too. That way, it is possible to distinguish invalid 
 registration messages from valid ones. The registration session ID can be 
 assigned once the TaskManager is notified about the new leader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2392) Instable test in flink-yarn-tests

2015-07-23 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638985#comment-14638985
 ] 

Till Rohrmann commented on FLINK-2392:
--

The same happened to me: 
https://s3.amazonaws.com/archive.travis-ci.org/jobs/72303347/log.txt

 Instable test in flink-yarn-tests
 -

 Key: FLINK-2392
 URL: https://issues.apache.org/jira/browse/FLINK-2392
 Project: Flink
  Issue Type: Bug
  Components: Tests
Reporter: Matthias J. Sax
Priority: Minor

 The test YARNSessionFIFOITCase fails from time to time on an irregular basis. 
 For example see: https://travis-ci.org/apache/flink/jobs/72019690
 Tests run: 12, Failures: 1, Errors: 0, Skipped: 2, Time elapsed: 205.163 sec 
  FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOITCase
 perJobYarnClusterWithParallelism(org.apache.flink.yarn.YARNSessionFIFOITCase) 
  Time elapsed: 60.651 sec   FAILURE!
 java.lang.AssertionError: During the timeout period of 60 seconds the 
 expected string did not show up
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.apache.flink.yarn.YarnTestBase.runWithArgs(YarnTestBase.java:478)
   at 
 org.apache.flink.yarn.YARNSessionFIFOITCase.perJobYarnClusterWithParallelism(YARNSessionFIFOITCase.java:435)
 Results :
 Failed tests: 
   
 YARNSessionFIFOITCase.perJobYarnClusterWithParallelism:435-YarnTestBase.runWithArgs:478
  During the timeout period of 60 seconds the expected string did not show up



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-23 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638429#comment-14638429
 ] 

Till Rohrmann commented on FLINK-1901:
--

That's a good idea to break down the task. Do you want to take the lead 
[~chengxiang li]?

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-23 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638448#comment-14638448
 ] 

Till Rohrmann commented on FLINK-1901:
--

Currently, whats happening to decide whether an operator is on a dynamic path 
or not is to look at the inputs of the operator. If they are dynamic so is the 
current operator. The iteration {{DataSets}}, {{WorksetPlaceHolder}}, 
{{SolutionSetPlaceHolder}} and {{PartialSolutionPlaceHolder}}, are always 
dynamic. What could be an idea is to allow other operators also to be declared 
dynamic. That way they can also start dynamic path. Afterwards, we have to make 
sure that not only the iteration {{DataSets}} get a {{IterationHead}} 
prepended, which kicks off the iterations, but also all the other operators 
which start a dynamic path. 

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2395) Check Scala catch blocks for Throwable

2015-07-23 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2395:


 Summary: Check Scala catch blocks for Throwable
 Key: FLINK-2395
 URL: https://issues.apache.org/jira/browse/FLINK-2395
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Priority: Minor


As described in [1], it's not a good practice to catch {{Throwables}} in Scala 
catch blocks because Scala uses some special exceptions for the control flow. 
Therefore we should check whether we can restrict ourselves to only catching 
subtypes of {{Throwable}}, such as {{Exception}}, instead.

[1] 
https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1737) Add statistical whitening transformation to machine learning library

2015-07-13 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624402#comment-14624402
 ] 

Till Rohrmann commented on FLINK-1737:
--

Great to hear [~Daniel Pape]. I assigned the ticket to you :-)

 Add statistical whitening transformation to machine learning library
 

 Key: FLINK-1737
 URL: https://issues.apache.org/jira/browse/FLINK-1737
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Daniel Pape
  Labels: ML, Starter

 The statistical whitening transformation [1] is a preprocessing step for 
 different ML algorithms. It decorrelates the individual dimensions and sets 
 its variance to 1.
 Statistical whitening should be implemented as a {{Transfomer}}.
 Resources:
 [1] [http://en.wikipedia.org/wiki/Whitening_transformation]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1737) Add statistical whitening transformation to machine learning library

2015-07-13 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1737:
-
Assignee: Daniel Pape

 Add statistical whitening transformation to machine learning library
 

 Key: FLINK-1737
 URL: https://issues.apache.org/jira/browse/FLINK-1737
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Daniel Pape
  Labels: ML, Starter

 The statistical whitening transformation [1] is a preprocessing step for 
 different ML algorithms. It decorrelates the individual dimensions and sets 
 its variance to 1.
 Statistical whitening should be implemented as a {{Transfomer}}.
 Resources:
 [1] [http://en.wikipedia.org/wiki/Whitening_transformation]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1748) Integrate PageRank implementation into machine learning library

2015-07-16 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1748:
-
Priority: Minor  (was: Major)

 Integrate PageRank implementation into machine learning library
 ---

 Key: FLINK-1748
 URL: https://issues.apache.org/jira/browse/FLINK-1748
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Priority: Minor
  Labels: ML, Starter

 We already have an excellent approximative PageRank [1] implementation which 
 has been contributed by [~StephanEwen]. Making this implementation part of 
 the machine learning library would be a great addition.
 Resources:
 [1] [http://en.wikipedia.org/wiki/PageRank]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629505#comment-14629505
 ] 

Till Rohrmann commented on FLINK-2366:
--

But why can you not do the same with ZK? If you start a DB process next to your 
Flink cluster, then you can also start a ZK process, right?

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629507#comment-14629507
 ] 

Till Rohrmann commented on FLINK-2366:
--

But in the end, you will execute your jobs on a Flink cluster, right? Or what 
do you do with Flink as a library?

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629565#comment-14629565
 ] 

Till Rohrmann commented on FLINK-2366:
--

What do you mean exactly with embedded and distributed?

If you use Flink's embedded mode, then this would mean that every node would 
work independently of each other. There is no possibility to make the embedded 
Flink instances to work together.

If you want to run Flink in a distributed manner, then you have to start a 
Flink cluster. And then you can also start a ZK on the same nodes, if there is 
none available. But usually, you want to run these kind of things on highly 
reliable nodes and not in yarn containers, for example.

On a first glance, copycat seems to be usable for HA, as well. It offers a 
similar functionality what we're currently using from ZK. It should not be a 
problem to implement a new {{LeaderElectionService}}/{{LeaderRetrievalService}} 
which uses copycat. However, copycat is still under heavy development and not 
recommended to be used in production. And you have the problem that your 
fault-tolerant value store is running on the same nodes as Flink. These nodes 
don't have to be necessarily reliable.

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Improvement
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-1748) Integrate PageRank implementation into machine learning library

2015-07-16 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-1748.

Resolution: Done

As part of Gelly.

 Integrate PageRank implementation into machine learning library
 ---

 Key: FLINK-1748
 URL: https://issues.apache.org/jira/browse/FLINK-1748
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Priority: Minor
  Labels: ML, Starter

 We already have an excellent approximative PageRank [1] implementation which 
 has been contributed by [~StephanEwen]. Making this implementation part of 
 the machine learning library would be a great addition.
 Resources:
 [1] [http://en.wikipedia.org/wiki/PageRank]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2366) HA Without ZooKeeper

2015-07-16 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2366:
-
Priority: Minor  (was: Blocker)

 HA Without ZooKeeper
 

 Key: FLINK-2366
 URL: https://issues.apache.org/jira/browse/FLINK-2366
 Project: Flink
  Issue Type: Bug
Reporter: Suminda Dharmasena
Priority: Minor

 Please provide a way to do HA without having to use ZK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634664#comment-14634664
 ] 

Till Rohrmann commented on FLINK-1901:
--

Hi Chengxiang,

good to hear that you want to work in this. I can assign you the ticket. 
However, it is not only about the sampling strategy but also about the 
integration within Flink. The reason is that we have to make sure that the 
sampling operator also works within iterations. This means that it has to be 
part of the dynamic path so that it is triggered for every iteration again and 
again. This will need a special operator type.

But you can start with the sampling strategies and then continue with the 
iteration integration.

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634800#comment-14634800
 ] 

Till Rohrmann commented on FLINK-1901:
--

Oh I forgot. Sorry.

What about the iteration support?

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634826#comment-14634826
 ] 

Till Rohrmann commented on FLINK-1901:
--

To be honest, I doubt that the sampling is executed repeatedly if it's not the 
iteration data set from which you're sampling. If you use map and reduce 
operations which lie on the static path, then the results will be executed once 
and cached. But best you check the samples.

If it is possible to create a separate PR out of it, then it would be great. 
Makes reviewing much easier.

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1901) Create sample operator for Dataset

2015-07-21 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1901:
-
Assignee: Chengxiang Li

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1901) Create sample operator for Dataset

2015-07-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635096#comment-14635096
 ] 

Till Rohrmann commented on FLINK-1901:
--

The problem is that a sampling operator should also work within iterations. 
There is definitely a big need for this, e.g. for stochastic gradient descent.

I don't really understand what you mean with your question [~sachingoel0101].

 Create sample operator for Dataset
 --

 Key: FLINK-1901
 URL: https://issues.apache.org/jira/browse/FLINK-1901
 Project: Flink
  Issue Type: Improvement
  Components: Core
Reporter: Theodore Vasiloudis
Assignee: Chengxiang Li

 In order to be able to implement Stochastic Gradient Descent and a number of 
 other machine learning algorithms we need to have a way to take a random 
 sample from a Dataset.
 We need to be able to sample with or without replacement from the Dataset, 
 choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2189) NullPointerException in MutableHashTable

2015-08-25 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711620#comment-14711620
 ] 

Till Rohrmann commented on FLINK-2189:
--

[~JonathanH5] encountered this problem recently.

 NullPointerException in MutableHashTable
 

 Key: FLINK-2189
 URL: https://issues.apache.org/jira/browse/FLINK-2189
 Project: Flink
  Issue Type: Bug
  Components: Core
Reporter: Till Rohrmann

 [~Felix Neutatz] reported a {{NullPointerException}} in the 
 {{MutableHashTable}} when running the {{ALS}} algorithm. The stack trace is 
 the following:
 {code}
 Caused by: java.lang.NullPointerException
   at 
 org.apache.flink.runtime.operators.hash.HashPartition.spillPartition(HashPartition.java:310)
   at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.spillPartition(MutableHashTable.java:1094)
   at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertBucketEntry(MutableHashTable.java:927)
   at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:783)
   at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
   at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
   at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
   at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
   at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
   at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 He produced this error on his local machine with the following code:
 {code}
 implicit val env = ExecutionEnvironment.getExecutionEnvironment
 val links = MovieLensUtils.readLinks(movieLensDir + links.csv)
 val movies = MovieLensUtils.readMovies(movieLensDir + movies.csv)
 val ratings = MovieLensUtils.readRatings(movieLensDir + ratings.csv)
 val tags = MovieLensUtils.readTags(movieLensDir + tags.csv)
   
 val ratingMatrix =  ratings.map { r = (r.userId.toInt, r.movieId.toInt, 
 r.rating) }
 val testMatrix =  ratings.map { r = (r.userId.toInt, r.movieId.toInt) }
 val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(150) 
  
 als.fit(ratingMatrix)
 val result = als.predict(testMatrix)
  
 result.print
 val risk = als.empiricalRisk(ratingMatrix).collect().apply(0)
 println(Empirical risk:  + risk) 
 env.execute()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1195) Improvement of benchmarking infrastructure

2015-08-25 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711604#comment-14711604
 ] 

Till Rohrmann commented on FLINK-1195:
--

I cannot really tell to what extent this PR is subsumed by [~mxm]'s testing 
infrastructure. But if that's the case, then this issue can be closed.

 Improvement of benchmarking infrastructure
 --

 Key: FLINK-1195
 URL: https://issues.apache.org/jira/browse/FLINK-1195
 Project: Flink
  Issue Type: Wish
Reporter: Till Rohrmann
Assignee: Alexander Alexandrov

 I noticed while running my ALS benchmarks that we still have some potential 
 to improve our benchmarking infrastructure. The current state is that we 
 execute the benchmark jobs by writing a script with a single set of 
 parameters. The runtime is then manually retrieved from the web interface of 
 Flink and Spark, respectively.
 I think we need the following extensions:
 * Automatic runtime retrieval and storage in a file
 * Repeated execution of jobs to gather some advanced statistics such as 
 mean and standard deviation of the runtimes
 * Support for value sets for the individual parameters
 The automatic runtime retrieval would allow us to execute several benchmarks 
 consecutively without having to lookup the runtimes in the logs or in the web 
 interface, which btw only stores the runtimes of the last 5 jobs.
 What I mean with value sets is that would be nice to specify a set of 
 parameter values for which the benchmark is run without having to write for 
 every single parameter combination a benchmark script. I believe that this 
 feature would become very handy when we want to look at the runtime behaviour 
 of Flink for different input sizes or degrees of parallelism, for example. To 
 illustrate what I mean:
 {code}
 INPUTSIZE = 1000, 2000, 4000, 8000
 DOP = 1, 2, 4, 8
 OUTPUT=benchmarkResults
 repetitions=10
 command=benchmark.jar -p $DOP $INPUTSIZE 
 {code} 
 Something like that would execute the benchmark job with (DOP=1, 
 INPUTSIZE=1000), (DOP=2, INPUTSIZE=2000), 10 times each, calculate for 
 each parameter combination runtime statistics and store the results in the 
 file benchmarkResults.
 I believe that spending some effort now will pay off in the long run because 
 we will benchmark Flink continuously. What do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2878) JobManager warns: Unexpected leader address pattern

2015-10-21 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2878.
--
Resolution: Fixed

Fixed via 3cad56d28d55025281873f53a28ec27ce1027992

> JobManager warns: Unexpected leader address pattern
> ---
>
> Key: FLINK-2878
> URL: https://issues.apache.org/jira/browse/FLINK-2878
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 0.10
>Reporter: Maximilian Michels
>Assignee: Till Rohrmann
>Priority: Minor
> Fix For: 0.10
>
>
> The JobManager log shows this multiple times when viewing the log through the 
> web frontend:
> {noformat}
> 16:58:37,201 WARN  
> org.apache.flink.runtime.webmonitor.handlers.HandlerRedirectUtils  - 
> Unexpected leader address pattern akka://flink/user/jobmanager. Cannot 
> extract host.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2921) Add online documentation of sample methods

2015-10-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2921:


 Summary: Add online documentation of sample methods
 Key: FLINK-2921
 URL: https://issues.apache.org/jira/browse/FLINK-2921
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.10
Reporter: Till Rohrmann
Priority: Minor


I couldn't find online documentation about Flink's sampling API (as part of the 
{{DataSetUtils}}/{{utils}} package object). We should add information for these 
methods to our online documentation so that people can more easily use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2894) Flink does not allow to specify default serializer for Kryo

2015-10-22 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2894:


 Summary: Flink does not allow to specify default serializer for 
Kryo
 Key: FLINK-2894
 URL: https://issues.apache.org/jira/browse/FLINK-2894
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.10
Reporter: Till Rohrmann


Currently, Flink only supports to specify Kryo {{Serializer}} for specific 
types but not default serializer for classes. A default serializer is used for 
the registered class and all its subclasses. That way one does not have to 
specify the serializer for each type individually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2894) Flink does not allow to specify default serializer for Kryo

2015-10-22 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2894.

Resolution: Not A Problem

> Flink does not allow to specify default serializer for Kryo
> ---
>
> Key: FLINK-2894
> URL: https://issues.apache.org/jira/browse/FLINK-2894
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 0.10
>Reporter: Till Rohrmann
>
> Currently, Flink only supports to specify Kryo {{Serializer}} for specific 
> types but not default serializer for classes. A default serializer is used 
> for the registered class and all its subclasses. That way one does not have 
> to specify the serializer for each type individually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-2800) kryo serialization problem

2015-10-22 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-2800:


Assignee: Till Rohrmann

> kryo serialization problem
> --
>
> Key: FLINK-2800
> URL: https://issues.apache.org/jira/browse/FLINK-2800
> Project: Flink
>  Issue Type: Bug
>  Components: Type Serialization System
>Affects Versions: 0.10
> Environment: linux ubuntu 12.04 LTS, Java 7
>Reporter: Stefano Bortoli
>Assignee: Till Rohrmann
>
> Performing a cross of two dataset of POJOs I have got the exception below. 
> The first time I run the process, there was no problem. When I run it the 
> second time, I have got the exception. My guess is that it could be a race 
> condition related to the reuse of the Kryo serializer object. However, it 
> could also be "a bug where type registrations are not properly forwarded to 
> all Serializers", as suggested by Stephan.
> 
> 2015-10-01 18:18:21 INFO  JobClient:161 - 10/01/2015 18:18:21 Cross(Cross at 
> main(FlinkMongoHadoop2LinkPOI2CDA.java:160))(3/4) switched to FAILED 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 114
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:210)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:127)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
>   at 
> org.apache.flink.runtime.operators.resettable.AbstractBlockResettableIterator.getNextRecord(AbstractBlockResettableIterator.java:180)
>   at 
> org.apache.flink.runtime.operators.resettable.BlockResettableMutableObjectIterator.next(BlockResettableMutableObjectIterator.java:111)
>   at 
> org.apache.flink.runtime.operators.CrossDriver.runBlockedOuterSecond(CrossDriver.java:309)
>   at 
> org.apache.flink.runtime.operators.CrossDriver.run(CrossDriver.java:162)
>   at 
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:489)
>   at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3036) Gelly's Graph.fromCsvReader method returns wrongly parameterized Graph

2015-11-17 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-3036:


 Summary: Gelly's Graph.fromCsvReader method returns wrongly 
parameterized Graph
 Key: FLINK-3036
 URL: https://issues.apache.org/jira/browse/FLINK-3036
 Project: Flink
  Issue Type: Bug
  Components: Gelly
Affects Versions: 0.10.0
Reporter: Till Rohrmann


The Scala method {{Graph.fromCsvReader}} of Gelly returns a wrongly typed 
{{Graph}} instance. The problem is that no return type has been explicitly 
defined for the method. Additionally, the method returns fundamentally 
incompatible types depending on the given parameters. So for example, the 
method can return a {{Graph[Long, Long, Long]}} if a vertex and edge file is 
specified (in this case with value type {{Long}}). If the vertex file is not 
specified and neither a vertex value initializer, then the return type is 
{{Graph[Long, NullValue, Long]}}. Since {{NullValue}} and {{Long}} have nothing 
in common, Scala's type inference infers that the {{fromCsvReader}} method must 
have a return type {{Graph[Long, t  >: Long with NullValue, Long]}} with {{t}} 
being a supertype of {{Long with NullValue}}. This type is not useful at all, 
since there is no such type. As a consequence, the user has to cast the 
resulting {{Graph}} to have either the type {{Graph[Long, NullValue, Long]}} or 
{{Graph[Long, Long, Long]}} if he wants to do something more elaborate than 
just collecting the edges for example. 

This can be especially confusing because one usually writes something like

{code}
val graph = Graph.fromCsvReader[Long, Double, Double](...)
graph.run(new PageRank(...))
{code}

and does not see that the type of {{graph}} is {{Graph[Long, t >: Double with 
NullValue, u >: Double with NullValue}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-3036) Gelly's Graph.fromCsvReader method returns wrongly parameterized Graph

2015-11-17 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-3036:


Assignee: Till Rohrmann

> Gelly's Graph.fromCsvReader method returns wrongly parameterized Graph
> --
>
> Key: FLINK-3036
> URL: https://issues.apache.org/jira/browse/FLINK-3036
> Project: Flink
>  Issue Type: Bug
>  Components: Gelly
>Affects Versions: 0.10.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>
> The Scala method {{Graph.fromCsvReader}} of Gelly returns a wrongly typed 
> {{Graph}} instance. The problem is that no return type has been explicitly 
> defined for the method. Additionally, the method returns fundamentally 
> incompatible types depending on the given parameters. So for example, the 
> method can return a {{Graph[Long, Long, Long]}} if a vertex and edge file is 
> specified (in this case with value type {{Long}}). If the vertex file is not 
> specified and neither a vertex value initializer, then the return type is 
> {{Graph[Long, NullValue, Long]}}. Since {{NullValue}} and {{Long}} have 
> nothing in common, Scala's type inference infers that the {{fromCsvReader}} 
> method must have a return type {{Graph[Long, t  >: Long with NullValue, 
> Long]}} with {{t}} being a supertype of {{Long with NullValue}}. This type is 
> not useful at all, since there is no such type. As a consequence, the user 
> has to cast the resulting {{Graph}} to have either the type {{Graph[Long, 
> NullValue, Long]}} or {{Graph[Long, Long, Long]}} if he wants to do something 
> more elaborate than just collecting the edges for example. 
> This can be especially confusing because one usually writes something like
> {code}
> val graph = Graph.fromCsvReader[Long, Double, Double](...)
> graph.run(new PageRank(...))
> {code}
> and does not see that the type of {{graph}} is {{Graph[Long, t >: Double with 
> NullValue, u >: Double with NullValue}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2979) RollingSink does not work with Hadoop 2.7.1

2015-11-05 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991725#comment-14991725
 ] 

Till Rohrmann edited comment on FLINK-2979 at 11/5/15 2:29 PM:
---

The failure might be caused by 

{code}
java.lang.Exception: Could not restore checkpointed state to operators and 
functions
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:414)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Failed to restore state to function: Could not 
invoke truncate.
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:165)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:406)
... 3 more
Caused by: java.lang.RuntimeException: Could not invoke truncate.
at 
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:695)
at 
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:120)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
... 4 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:678)
... 6 more
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
 Failed to TRUNCATE_FILE /string-non-rolling-out/part-2-2 for 
DFSClient_NONMAPREDUCE_-401178409_229 on 127.0.0.1 because 
DFSClient_NONMAPREDUCE_-401178409_229 is already the current lease holder.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInternal(FSNamesystem.java:2082)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInt(FSNamesystem.java:2028)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncate(FSNamesystem.java:1998)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.truncate(NameNodeRpcServer.java:926)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.truncate(ClientNamenodeProtocolServerSideTranslatorPB.java:599)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy23.truncate(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.truncate(ClientNamenodeProtocolTranslatorPB.java:313)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy24.truncate(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.truncate(DFSClient.java:2024)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:689)
at 

[jira] [Commented] (FLINK-2979) RollingSink does not work with Hadoop 2.7.1

2015-11-05 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991725#comment-14991725
 ] 

Till Rohrmann commented on FLINK-2979:
--

The failure might be caused by 

{code}
java.lang.Exception: Could not restore checkpointed state to operators and 
functions
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:414)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Failed to restore state to function: Could not 
invoke truncate.
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:165)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:406)
... 3 more
Caused by: java.lang.RuntimeException: Could not invoke truncate.
at 
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:695)
at 
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:120)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:162)
... 4 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:678)
... 6 more
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
 Failed to TRUNCATE_FILE /string-non-rolling-out/part-2-2 for 
DFSClient_NONMAPREDUCE_-401178409_229 on 127.0.0.1 because 
DFSClient_NONMAPREDUCE_-401178409_229 is already the current lease holder.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2885)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInternal(FSNamesystem.java:2082)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncateInt(FSNamesystem.java:2028)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.truncate(FSNamesystem.java:1998)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.truncate(NameNodeRpcServer.java:926)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.truncate(ClientNamenodeProtocolServerSideTranslatorPB.java:599)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy23.truncate(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.truncate(ClientNamenodeProtocolTranslatorPB.java:313)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy24.truncate(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.truncate(DFSClient.java:2024)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:689)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$13.doCall(DistributedFileSystem.java:685)
at 

[jira] [Created] (FLINK-2964) MutableHashTable fails when spilling partitions without overflow segments

2015-11-03 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2964:


 Summary: MutableHashTable fails when spilling partitions without 
overflow segments
 Key: FLINK-2964
 URL: https://issues.apache.org/jira/browse/FLINK-2964
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.10
Reporter: Till Rohrmann
Assignee: Till Rohrmann
Priority: Critical


When one performs a join operation with many and large records then the join 
operation fails with the following exception when it tries to spill a 
{{HashPartition}}.

{code}
java.lang.RuntimeException: Bug in Hybrid Hash Join: Request to spill a 
partition with less than two buffers.
at 
org.apache.flink.runtime.operators.hash.HashPartition.spillPartition(HashPartition.java:302)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.spillPartition(MutableHashTable.java:1108)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.nextSegment(MutableHashTable.java:1277)
at 
org.apache.flink.runtime.operators.hash.HashPartition$BuildSideBuffer.nextSegment(HashPartition.java:524)
at 
org.apache.flink.runtime.memory.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140)
at 
org.apache.flink.runtime.memory.AbstractPagedOutputView.write(AbstractPagedOutputView.java:201)
at 
org.apache.flink.runtime.memory.AbstractPagedOutputView.write(AbstractPagedOutputView.java:178)
at 
org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.serialize(BytePrimitiveArraySerializer.java:74)
at 
org.apache.flink.api.common.typeutils.base.array.BytePrimitiveArraySerializer.serialize(BytePrimitiveArraySerializer.java:30)
at 
org.apache.flink.runtime.operators.hash.HashPartition.insertIntoBuildBuffer(HashPartition.java:257)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:856)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.buildInitialTable(MutableHashTable.java:685)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.open(MutableHashTable.java:443)
at 
org.apache.flink.runtime.operators.hash.HashTableTest.testSpillingWhenBuildingTableWithoutOverflow(HashTableTest.java:234)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:78)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:212)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
{code}

The reason is that the {{HashPartition}} does not include the number of used 
memory segments by the {{BuildSideBuffer}} when it counts the currently 
occupied memory segments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts

2015-11-02 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985022#comment-14985022
 ] 

Till Rohrmann commented on FLINK-2929:
--

Sure, but for me it would make more sense to add the option for the case which 
is more unlikely and that's probably the upgrading case.

> Recovery of jobs on cluster restarts
> 
>
> Key: FLINK-2929
> URL: https://issues.apache.org/jira/browse/FLINK-2929
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 0.10
>Reporter: Ufuk Celebi
>
> Recovery information is stored in ZooKeeper under a static root like 
> {{/flink}}. In case of a cluster restart without canceling running jobs old 
> jobs will be recovered from ZooKeeper.
> This can be confusing or helpful depending on the use case.
> I suspect that the confusing case will be more common.
> We can change the default cluster start up (e.g. new YARN session or new 
> ./start-cluster call) to purge all existing data in ZooKeeper and add a flag 
> to not do this if needed.
> [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2959) Remove number of execution retries configuration from Environment

2015-11-04 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989289#comment-14989289
 ] 

Till Rohrmann commented on FLINK-2959:
--

I agree that we have to come up with a consistent way of putting methods in the 
{{ExecutionConfig}} or in the {{ExecutionEnvironment}}. But for {{0.10}} it 
should be fine.

> Remove number of execution retries configuration from Environment
> -
>
> Key: FLINK-2959
> URL: https://issues.apache.org/jira/browse/FLINK-2959
> Project: Flink
>  Issue Type: Improvement
>  Components: Java API, Scala API
>Affects Versions: 0.10
>Reporter: Ufuk Celebi
>Priority: Minor
>  Labels: api-breaking
> Fix For: 0.10
>
>
> The number of execution retries is configured via the Environment, but all 
> other execution configuration happens exclusively via ExecutionConfig.
> I think it will be more consistent to have it in ExecutionConfig only.
> This will be an API breaking change.
> What's your opinion? Should we move it to the ExecutionConfig?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2763) Bug in Hybrid Hash Join: Request to spill a partition with less than two buffers.

2015-11-04 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2763.
--
Resolution: Fixed

I think the problem was solved by FLINK-2964. [~f.pompermaier] could you check 
whether this fix solves your problem?

> Bug in Hybrid Hash Join: Request to spill a partition with less than two 
> buffers.
> -
>
> Key: FLINK-2763
> URL: https://issues.apache.org/jira/browse/FLINK-2763
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.10
>Reporter: Greg Hogan
>Assignee: Stephan Ewen
> Fix For: 0.10
>
>
> The following exception is thrown when running the example triangle listing 
> with an unmodified master build (4cadc3d6).
> {noformat}
> ./bin/flink run 
> ~/flink-examples/flink-java-examples/target/flink-java-examples-0.10-SNAPSHOT-EnumTrianglesOpt.jar
>  ~/rmat/undirected/s19_e8.ssv output
> {noformat}
> The only changes to {{flink-conf.yaml}} are {{taskmanager.numberOfTaskSlots: 
> 8}} and {{parallelism.default: 8}}.
> I have confirmed with input files 
> [s19_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxR2lnMHR4amdyTnM/view?usp=sharing]
>  (40 MB) and 
> [s20_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxNi1HbmptU29MTm8/view?usp=sharing]
>  (83 MB). On a second machine only the larger file caused the exception.
> {noformat}
> org.apache.flink.client.program.ProgramInvocationException: The program 
> execution failed: Job execution failed.
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:407)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:386)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:353)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:64)
>   at 
> org.apache.flink.examples.java.graph.EnumTrianglesOpt.main(EnumTrianglesOpt.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:434)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:350)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:290)
>   at 
> org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:675)
>   at org.apache.flink.client.CliFrontend.run(CliFrontend.java:324)
>   at 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977)
>   at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1027)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:425)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:107)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> 

[jira] [Comment Edited] (FLINK-2763) Bug in Hybrid Hash Join: Request to spill a partition with less than two buffers.

2015-11-04 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989297#comment-14989297
 ] 

Till Rohrmann edited comment on FLINK-2763 at 11/4/15 10:43 AM:


I think the problem was solved by FLINK-2964. [~f.pompermaier] could you check 
whether the fix 76bebd4236cd9cff19e6442e9ab3d6113665924a solves your problem?


was (Author: till.rohrmann):
I think the problem was solved by FLINK-2964. [~f.pompermaier] could you check 
whether this fix solves your problem?

> Bug in Hybrid Hash Join: Request to spill a partition with less than two 
> buffers.
> -
>
> Key: FLINK-2763
> URL: https://issues.apache.org/jira/browse/FLINK-2763
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.10
>Reporter: Greg Hogan
>Assignee: Stephan Ewen
> Fix For: 0.10
>
>
> The following exception is thrown when running the example triangle listing 
> with an unmodified master build (4cadc3d6).
> {noformat}
> ./bin/flink run 
> ~/flink-examples/flink-java-examples/target/flink-java-examples-0.10-SNAPSHOT-EnumTrianglesOpt.jar
>  ~/rmat/undirected/s19_e8.ssv output
> {noformat}
> The only changes to {{flink-conf.yaml}} are {{taskmanager.numberOfTaskSlots: 
> 8}} and {{parallelism.default: 8}}.
> I have confirmed with input files 
> [s19_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxR2lnMHR4amdyTnM/view?usp=sharing]
>  (40 MB) and 
> [s20_e8.ssv|https://drive.google.com/file/d/0B6TrSsnHj2HxNi1HbmptU29MTm8/view?usp=sharing]
>  (83 MB). On a second machine only the larger file caused the exception.
> {noformat}
> org.apache.flink.client.program.ProgramInvocationException: The program 
> execution failed: Job execution failed.
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:407)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:386)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:353)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:64)
>   at 
> org.apache.flink.examples.java.graph.EnumTrianglesOpt.main(EnumTrianglesOpt.java:125)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:434)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:350)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:290)
>   at 
> org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:675)
>   at org.apache.flink.client.CliFrontend.run(CliFrontend.java:324)
>   at 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977)
>   at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1027)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:425)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:107)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at 

[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts

2015-11-03 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987374#comment-14987374
 ] 

Till Rohrmann commented on FLINK-2929:
--

We could auto generate a random ZNode path for each cluster start. In case of a 
clean shutdown this path could be removed if not explicitly set to be kept. 
When starting a new cluster we then could add an option to start with a 
specific znode path in order to recover from or in case of an upgrade. However, 
this has the disadvantage that the user would be responsible for cleaning up 
the state data when it's no longer needed.

> Recovery of jobs on cluster restarts
> 
>
> Key: FLINK-2929
> URL: https://issues.apache.org/jira/browse/FLINK-2929
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 0.10
>Reporter: Ufuk Celebi
>
> Recovery information is stored in ZooKeeper under a static root like 
> {{/flink}}. In case of a cluster restart without canceling running jobs old 
> jobs will be recovered from ZooKeeper.
> This can be confusing or helpful depending on the use case.
> I suspect that the confusing case will be more common.
> We can change the default cluster start up (e.g. new YARN session or new 
> ./start-cluster call) to purge all existing data in ZooKeeper and add a flag 
> to not do this if needed.
> [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2852) Fix flaky ScalaShellITSuite

2015-10-14 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2852:


 Summary: Fix flaky ScalaShellITSuite
 Key: FLINK-2852
 URL: https://issues.apache.org/jira/browse/FLINK-2852
 Project: Flink
  Issue Type: Bug
  Components: Scala Shell
Affects Versions: 0.10
Reporter: Till Rohrmann
Assignee: Till Rohrmann
Priority: Critical
 Fix For: 0.10


The {{ScalaShellITSuite}} checks the log output whether a job has successful 
completed or not. For that to happen it checks for a {{Job execution switched 
to status FINISHED}} string in the log output. However, if the 
{{FlinkClientActor}} receives first a {{JobResultSuccess}} message before 
receiving the {{JobStatusChanged}} message, then it will send the execution 
result back to the {{Client}} and terminate itself. This has the consequence 
that the output will never contain the above-mentioned string.

I propose to use a different mean to check whether a job has finished 
successfully or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2735) KafkaProducerITCase.testCustomPartitioning sporadically fails

2015-10-15 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958775#comment-14958775
 ] 

Till Rohrmann commented on FLINK-2735:
--

Here is another instance of the problem: 
https://s3.amazonaws.com/archive.travis-ci.org/jobs/85472091/log.txt

> KafkaProducerITCase.testCustomPartitioning sporadically fails
> -
>
> Key: FLINK-2735
> URL: https://issues.apache.org/jira/browse/FLINK-2735
> Project: Flink
>  Issue Type: Bug
>  Components: Kafka Connector
>Affects Versions: 0.10
>Reporter: Robert Metzger
>  Labels: test-stability
>
> In the following test run: 
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/8158/log.txt
> there was the following failure
> {code}
> Caused by: java.lang.Exception: Unable to get last offset for topic 
> customPartitioningTestTopic and partitions [FetchPartition {partition=2, 
> offset=-915623761776}]. 
> Exception for partition 2: kafka.common.UnknownException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at java.lang.Class.newInstance(Class.java:438)
>   at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
>   at kafka.common.ErrorMapping.exceptionFor(ErrorMapping.scala)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getLastOffset(LegacyFetcher.java:521)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:370)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher.run(LegacyFetcher.java:242)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.run(FlinkKafkaConsumer.java:382)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:58)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:168)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Unable to get last offset for topic 
> customPartitioningTestTopic and partitions [FetchPartition {partition=2, 
> offset=-915623761776}]. 
> Exception for partition 2: kafka.common.UnknownException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
>   at java.lang.Class.newInstance(Class.java:438)
>   at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
>   at kafka.common.ErrorMapping.exceptionFor(ErrorMapping.scala)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getLastOffset(LegacyFetcher.java:521)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:370)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.getLastOffset(LegacyFetcher.java:524)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:370)
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.455 sec 
> <<< FAILURE! - in 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerITCase
> testCustomPartitioning(org.apache.flink.streaming.connectors.kafka.KafkaProducerITCase)
>   Time elapsed: 7.809 sec  <<< FAILURE!
> java.lang.AssertionError: Test failed: The program execution failed: Job 
> execution failed.
>   at org.junit.Assert.fail(Assert.java:88)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaTestBase.tryExecute(KafkaTestBase.java:313)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerITCase.testCustomPartitioning(KafkaProducerITCase.java:155)
> {code}
> From the broker logs it seems to be an issue in the Kafka broker
> {code}
> 14:43:03,328 INFO  kafka.network.Processor
>- Closing socket connection to /127.0.0.1.
> 14:43:03,334 WARN  kafka.server.KafkaApis 
>- [KafkaApi-0] 

[jira] [Closed] (FLINK-2770) KafkaITCase.testConcurrentProducerConsumerTopology fails

2015-10-15 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2770.

Resolution: Duplicate

> KafkaITCase.testConcurrentProducerConsumerTopology fails
> 
>
> Key: FLINK-2770
> URL: https://issues.apache.org/jira/browse/FLINK-2770
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 0.10
>Reporter: Matthias J. Sax
>Priority: Critical
> Fix For: 0.10
>
>
> https://travis-ci.org/mjsax/flink/jobs/82308003
> {noformat}
> Running org.apache.flink.streaming.connectors.kafka.KafkaITCase
> 09/26/2015 17:52:50   Job execution switched to status RUNNING.
> 09/26/2015 17:52:50   Source: Custom Source -> Sink: Unnamed(1/1) switched to 
> SCHEDULED 
> 09/26/2015 17:52:50   Source: Custom Source -> Sink: Unnamed(1/1) switched to 
> DEPLOYING 
> 09/26/2015 17:52:50   Source: Custom Source -> Sink: Unnamed(1/1) switched to 
> RUNNING 
> 09/26/2015 17:52:50   Source: Custom Source -> Sink: Unnamed(1/1) switched to 
> FINISHED 
> 09/26/2015 17:52:50   Job execution switched to status FINISHED.
> 09/26/2015 17:52:50   Job execution switched to status RUNNING.
> 09/26/2015 17:52:50   Source: Custom Source -> Map -> Flat Map(1/1) switched 
> to SCHEDULED 
> 09/26/2015 17:52:50   Source: Custom Source -> Map -> Flat Map(1/1) switched 
> to DEPLOYING 
> 09/26/2015 17:52:50   Source: Custom Source -> Map -> Flat Map(1/1) switched 
> to RUNNING 
> 09/26/2015 17:52:51   Source: Custom Source -> Map -> Flat Map(1/1) switched 
> to FAILED 
> java.lang.Exception: Could not forward element to next operator
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher.run(LegacyFetcher.java:242)
>   at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.run(FlinkKafkaConsumer.java:382)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:57)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:198)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:580)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Could not forward element to next 
> operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:332)
>   at 
> org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:316)
>   at 
> org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:50)
>   at 
> org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:30)
>   at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$SourceOutput.collect(SourceStreamTask.java:106)
>   at 
> org.apache.flink.streaming.api.operators.StreamSource$NonTimestampContext.collect(StreamSource.java:92)
>   at 
> org.apache.flink.streaming.connectors.kafka.internals.LegacyFetcher$SimpleConsumerThread.run(LegacyFetcher.java:449)
> Caused by: java.lang.RuntimeException: Could not forward element to next 
> operator
>   at 
> org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:332)
>   at 
> org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:316)
>   at 
> org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:50)
>   at 
> org.apache.flink.streaming.runtime.io.CollectorWrapper.collect(CollectorWrapper.java:30)
>   at 
> org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:37)
>   at 
> org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:329)
>   ... 6 more
> Caused by: 
> org.apache.flink.streaming.connectors.kafka.testutils.SuccessException
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase$7.flatMap(KafkaConsumerTestBase.java:931)
>   at 
> org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase$7.flatMap(KafkaConsumerTestBase.java:911)
>   at 
> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:47)
>   at 
> org.apache.flink.streaming.runtime.tasks.OutputHandler$CopyingChainingOutput.collect(OutputHandler.java:329)
>   ... 11 more
> 09/26/2015 17:52:51   Job execution switched to status FAILING.
> 09/26/2015 17:52:51   Job execution switched to status FAILED.
> Tests run: 12, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 80.981 sec 
> <<< FAILURE! - in org.apache.flink.streaming.connectors.kafka.KafkaITCase
> 

[jira] [Created] (FLINK-2854) KafkaITCase.testOneSourceMultiplePartitions failed on Travis

2015-10-15 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2854:


 Summary: KafkaITCase.testOneSourceMultiplePartitions failed on 
Travis
 Key: FLINK-2854
 URL: https://issues.apache.org/jira/browse/FLINK-2854
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector
Affects Versions: 0.10
Reporter: Till Rohrmann
Priority: Critical


The {{KafkaITCase.testOneSourceMultiplePartitions}} failed on Travis with no 
output for 300s.

https://s3.amazonaws.com/archive.travis-ci.org/jobs/85472083/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2858) Cannot build Flink Scala 2.11 with IntelliJ

2015-10-15 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2858:


 Summary: Cannot build Flink Scala 2.11 with IntelliJ
 Key: FLINK-2858
 URL: https://issues.apache.org/jira/browse/FLINK-2858
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.10
Reporter: Till Rohrmann


If I activate the scala-2.11 profile from within IntelliJ (and thus deactivate 
the scala-2.10 profile) in order to build Flink with Scala 2.11, then Flink 
cannot be built. The problem is that some Scala macros cannot be expanded 
because they were compiled with the wrong version (I assume 2.10).

This makes debugging tests with Scala 2.11 in IntelliJ impossible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2852) Fix flaky ScalaShellITSuite

2015-10-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964757#comment-14964757
 ] 

Till Rohrmann commented on FLINK-2852:
--

Hi [~sachingoel0101], you're right that the test is not robust. I've reworked 
and pushed a commit which should hopefully solve the problem.

> Fix flaky ScalaShellITSuite
> ---
>
> Key: FLINK-2852
> URL: https://issues.apache.org/jira/browse/FLINK-2852
> Project: Flink
>  Issue Type: Bug
>  Components: Scala Shell
>Affects Versions: 0.10
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 0.10
>
>
> The {{ScalaShellITSuite}} checks the log output whether a job has successful 
> completed or not. For that to happen it checks for a {{Job execution switched 
> to status FINISHED}} string in the log output. However, if the 
> {{FlinkClientActor}} receives first a {{JobResultSuccess}} message before 
> receiving the {{JobStatusChanged}} message, then it will send the execution 
> result back to the {{Client}} and terminate itself. This has the consequence 
> that the output will never contain the above-mentioned string.
> I propose to use a different mean to check whether a job has finished 
> successfully or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2852) Fix flaky ScalaShellITSuite

2015-10-20 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2852.

Resolution: Fixed

Fixed in 630798d

> Fix flaky ScalaShellITSuite
> ---
>
> Key: FLINK-2852
> URL: https://issues.apache.org/jira/browse/FLINK-2852
> Project: Flink
>  Issue Type: Bug
>  Components: Scala Shell
>Affects Versions: 0.10
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
>  Labels: test-stability
> Fix For: 0.10
>
>
> The {{ScalaShellITSuite}} checks the log output whether a job has successful 
> completed or not. For that to happen it checks for a {{Job execution switched 
> to status FINISHED}} string in the log output. However, if the 
> {{FlinkClientActor}} receives first a {{JobResultSuccess}} message before 
> receiving the {{JobStatusChanged}} message, then it will send the execution 
> result back to the {{Client}} and terminate itself. This has the consequence 
> that the output will never contain the above-mentioned string.
> I propose to use a different mean to check whether a job has finished 
> successfully or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2652) Failing PartitionRequestClientFactoryTest

2015-10-20 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2652.
--
Resolution: Fixed

Fixed in b2339464

> Failing PartitionRequestClientFactoryTest
> -
>
> Key: FLINK-2652
> URL: https://issues.apache.org/jira/browse/FLINK-2652
> Project: Flink
>  Issue Type: Bug
>Reporter: Ufuk Celebi
>Priority: Minor
>  Labels: test-stability
>
> PartitionRequestClientFactoryTest fails when running {{mvn 
> -Dhadoop.version=2.6.0 clean verify}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2804) Support blocking job submission with Job Manager recovery

2015-10-20 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-2804.

Resolution: Fixed

Added in d18f580

> Support blocking job submission with Job Manager recovery
> -
>
> Key: FLINK-2804
> URL: https://issues.apache.org/jira/browse/FLINK-2804
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 0.10
>Reporter: Ufuk Celebi
>Assignee: Till Rohrmann
>Priority: Minor
>
> Submitting a job in a blocking fashion with JobManager recovery and a failing 
> JobManager fails on the client side (the one submitting the job). The job 
> still continues to be recovered.
> I propose to add simple support to re-retrieve the leading job manager and 
> update the client actor with it and then wait for the result as before.
> As of the current standing in PR #1153 
> (https://github.com/apache/flink/pull/1153) the job manager assumes that the 
> same actor is running and just keeps on sending execution state updates etc. 
> (if the listening behaviour is not detached).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2792) Set log level of actor messages to TRACE

2015-10-20 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2792.
--
Resolution: Fixed

Fixed in 3aaee1e

> Set log level of actor messages to TRACE
> 
>
> Key: FLINK-2792
> URL: https://issues.apache.org/jira/browse/FLINK-2792
> Project: Flink
>  Issue Type: Wish
>  Components: JobManager
>Reporter: Ufuk Celebi
>Priority: Trivial
>
> Logging of received job manager actor messages happens at log level DEBUG 
> right now. The used logger is that of the JobManager/TaskManager 
> respectively. This means that as soon as you debug something related to the 
> JobManager/TaskManager you are always flooded with a lot of debug messages.
> Therefore, I would like to set the log level to TRACE for these messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2354) Recover running jobs on JobManager failure

2015-10-20 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2354.
--
Resolution: Fixed

Added in a6890b2

> Recover running jobs on JobManager failure
> --
>
> Key: FLINK-2354
> URL: https://issues.apache.org/jira/browse/FLINK-2354
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager
>Affects Versions: 0.10
>Reporter: Ufuk Celebi
>Assignee: Ufuk Celebi
> Fix For: 0.10
>
>
> tl;dr Persist JobGraphs in state backend and coordinate reference to state 
> handle via ZooKeeper.
> Problem: When running multiple JobManagers in high availability mode, the 
> leading job manager looses all running jobs when it fails. After a new 
> leading job manager is elected, it is not possible to recover any previously 
> running jobs.
> Solution: The leading job manager, which receives the job graph writes 1) the 
> job graph to a state backend, and 2) a reference to the respective state 
> handle to ZooKeeper. In general, job graphs can become large (multiple MBs, 
> because they include closures etc.). ZooKeeper is not designed for data of 
> this size. The level of indirection via the reference to the state backend 
> keeps the data in ZooKeeper small.
> Proposed ZooKeeper layout:
> /flink (default)
>   +- currentJobs
>+- job id i
> +- state handle reference of job graph i
> The 'currentJobs' node needs to be persistent to allow recovery of jobs 
> between job managers. The currentJobs node needs to satisfy the following 
> invariant: There is a reference to a job graph with id i IFF the respective 
> job graph needs to be recovered by a newly elected job manager leader.
> With this in place, jobs will be recovered from their initial state (as if 
> resubmitted). The next step is to backup the runtime state handles of 
> checkpoints in a similar manner.
> ---
> This work will be based on [~trohrm...@apache.org]'s implementation of 
> FLINK-2291. The leader election service notifies the job manager about 
> granted/revoked leadership. This notification happens via Akka and thus 
> serially *per* job manager, but results in eventually consistent state 
> between job managers. For some snapshots of time it is possible to have a new 
> leader granted leadership, before the old one has been revoked its leadership.
> [~trohrm...@apache.org], can you confirm that leadership does not guarantee 
> mutually exclusive access to the shared 'currentJobs' state?
> For example, the following can happen:
> - JM 1 is leader, JM 2 is standby
> - JOB i is running (and hence /flink/currentJobs/i exists)
> - ZK notifies leader election service (LES) of JM 1 and JM 2
> - LES 2 immediately notifies JM 2 about granted leadership, but LES 1 
> notification revoking leadership takes longer
> - JOB i finishes (TMs don't notice leadership change yet) and JM 1 receives 
> final JobStatusChange
> - JM 2 resubmits the job /flink/currentJobs/i
> - JM 1 removes /flink/currentJobs/i, because it is now finished
> => inconsistent state (wrt the specified invariant above)
> If it is indeed a problem, we can circumvent this with a Curator recipe for 
> [shared locks|http://curator.apache.org/curator-recipes/shared-lock.html] to 
> coordinate the access to currentJobs. The lock needs to be acquired on 
> leadership.
> ---
> Minimum required tests:
> - Unit tests for job graph serialization and writing to state backend and 
> ZooKeeper with expected nodes
> - Unit tests for job submission to job manager in leader/non-leader state
> - Unit tests for leadership granting/revoking and job submission/restarting 
> interleavings
> - Process failure integration tests with single and multiple running jobs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2883) Combinable reduce produces wrong result

2015-10-21 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966537#comment-14966537
 ] 

Till Rohrmann commented on FLINK-2883:
--

Yes, documenting it sounds like a reasonable solution. This is probably also 
more of a corner case. 

> Combinable reduce produces wrong result
> ---
>
> Key: FLINK-2883
> URL: https://issues.apache.org/jira/browse/FLINK-2883
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 0.10
>Reporter: Till Rohrmann
>
> If one uses a combinable reduce operation which also changes the key value of 
> the underlying data element, then the results of the reduce operation can 
> become wrong. The reason is that after the combine phase, another reduce 
> operator is executed which will then reduce the elements based on the new key 
> values. This might be not so surprising if one explicitly defined ones 
> {{GroupReduceOperation}} as combinable. However, the {{ReduceFunction}} 
> conceals the fact that a combiner is used implicitly. Furthermore, the API 
> does not prevent the user from changing the key fields which could solve the 
> problem.
> The following example program illustrates the problem
> {code}
> val env = ExecutionEnvironment.getExecutionEnvironment
> env.setParallelism(1)
> val input = env.fromElements((1,2), (1,3), (2,3), (3,3), (3,4))
> val result = input.groupBy(0).reduce{
>   (left, right) =>
> (left._1 + right._1, left._2 + right._2)
> }
> result.output(new PrintingOutputFormat[Int]())
> env.execute()
> {code}
> The expected output is 
> {code}
> (2, 5)
> (2, 3)
> (6, 7)
> {code}
> However, the actual output is
> {code}
> (4, 8)
> (6, 7)
> {code}
> I think that the underlying problem is that associativity and commutativity 
> is not sufficient for a combinable reduce operation. Additionally we also 
> need to make sure that the key stays the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2858) Cannot build Flink Scala 2.11 with IntelliJ

2015-10-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965087#comment-14965087
 ] 

Till Rohrmann commented on FLINK-2858:
--

Thanks for your help [~aalexandrov] :-) I did not know how to properly change 
the Scala version. I'm ok with changing the version via the shell script and 
documenting it on the web site.

> Cannot build Flink Scala 2.11 with IntelliJ
> ---
>
> Key: FLINK-2858
> URL: https://issues.apache.org/jira/browse/FLINK-2858
> Project: Flink
>  Issue Type: Bug
>  Components: Build System
>Affects Versions: 0.10
>Reporter: Till Rohrmann
>
> If I activate the scala-2.11 profile from within IntelliJ (and thus 
> deactivate the scala-2.10 profile) in order to build Flink with Scala 2.11, 
> then Flink cannot be built. The problem is that some Scala macros cannot be 
> expanded because they were compiled with the wrong version (I assume 2.10).
> This makes debugging tests with Scala 2.11 in IntelliJ impossible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-2804) Support blocking job submission with Job Manager recovery

2015-10-07 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-2804:


Assignee: Till Rohrmann  (was: Ufuk Celebi)

> Support blocking job submission with Job Manager recovery
> -
>
> Key: FLINK-2804
> URL: https://issues.apache.org/jira/browse/FLINK-2804
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 0.10
>Reporter: Ufuk Celebi
>Assignee: Till Rohrmann
>Priority: Minor
>
> Submitting a job in a blocking fashion with JobManager recovery and a failing 
> JobManager fails on the client side (the one submitting the job). The job 
> still continues to be recovered.
> I propose to add simple support to re-retrieve the leading job manager and 
> update the client actor with it and then wait for the result as before.
> As of the current standing in PR #1153 
> (https://github.com/apache/flink/pull/1153) the job manager assumes that the 
> same actor is running and just keeps on sending execution state updates etc. 
> (if the listening behaviour is not detached).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2800) kryo serialization problem

2015-10-06 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944631#comment-14944631
 ] 

Till Rohrmann commented on FLINK-2800:
--

Hi [~stefano.bortoli], do you have small code example to reproduce the problem? 
Or does it happen with any cross operation?

> kryo serialization problem
> --
>
> Key: FLINK-2800
> URL: https://issues.apache.org/jira/browse/FLINK-2800
> Project: Flink
>  Issue Type: Bug
>  Components: Type Serialization System
>Affects Versions: master
> Environment: linux ubuntu 12.04 LTS, Java 7
>Reporter: Stefano Bortoli
>
> Performing a cross of two dataset of POJOs I have got the exception below. 
> The first time I run the process, there was no problem. When I run it the 
> second time, I have got the exception. My guess is that it could be a race 
> condition related to the reuse of the Kryo serializer object. However, it 
> could also be "a bug where type registrations are not properly forwarded to 
> all Serializers", as suggested by Stephan.
> 
> 2015-10-01 18:18:21 INFO  JobClient:161 - 10/01/2015 18:18:21 Cross(Cross at 
> main(FlinkMongoHadoop2LinkPOI2CDA.java:160))(3/4) switched to FAILED 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 114
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:210)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:127)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
>   at 
> org.apache.flink.runtime.operators.resettable.AbstractBlockResettableIterator.getNextRecord(AbstractBlockResettableIterator.java:180)
>   at 
> org.apache.flink.runtime.operators.resettable.BlockResettableMutableObjectIterator.next(BlockResettableMutableObjectIterator.java:111)
>   at 
> org.apache.flink.runtime.operators.CrossDriver.runBlockedOuterSecond(CrossDriver.java:309)
>   at 
> org.apache.flink.runtime.operators.CrossDriver.run(CrossDriver.java:162)
>   at 
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:489)
>   at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2800) kryo serialization problem

2015-10-06 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944659#comment-14944659
 ] 

Till Rohrmann commented on FLINK-2800:
--

If I'm not mistaken, then this code example is not complete. Would be great if 
you could fill in the gaps like {{POI2CDACombineGroupFunction}}, 
{{AddProxy2POIReduceGroupFunction}} and {{GetEntitonForClass}} or if you could 
distill the whole example down to something like

{code}
DataSet> input1 = env.createFromElements()
DataSet> input2 = env.createFromElements()

input1.cross(input2).print()
{code}

if this reproduces the problem for you. Thanks for your help.

> kryo serialization problem
> --
>
> Key: FLINK-2800
> URL: https://issues.apache.org/jira/browse/FLINK-2800
> Project: Flink
>  Issue Type: Bug
>  Components: Type Serialization System
>Affects Versions: master
> Environment: linux ubuntu 12.04 LTS, Java 7
>Reporter: Stefano Bortoli
>
> Performing a cross of two dataset of POJOs I have got the exception below. 
> The first time I run the process, there was no problem. When I run it the 
> second time, I have got the exception. My guess is that it could be a race 
> condition related to the reuse of the Kryo serializer object. However, it 
> could also be "a bug where type registrations are not properly forwarded to 
> all Serializers", as suggested by Stephan.
> 
> 2015-10-01 18:18:21 INFO  JobClient:161 - 10/01/2015 18:18:21 Cross(Cross at 
> main(FlinkMongoHadoop2LinkPOI2CDA.java:160))(3/4) switched to FAILED 
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 114
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:210)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:127)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.deserialize(TupleSerializer.java:30)
>   at 
> org.apache.flink.runtime.operators.resettable.AbstractBlockResettableIterator.getNextRecord(AbstractBlockResettableIterator.java:180)
>   at 
> org.apache.flink.runtime.operators.resettable.BlockResettableMutableObjectIterator.next(BlockResettableMutableObjectIterator.java:111)
>   at 
> org.apache.flink.runtime.operators.CrossDriver.runBlockedOuterSecond(CrossDriver.java:309)
>   at 
> org.apache.flink.runtime.operators.CrossDriver.run(CrossDriver.java:162)
>   at 
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:489)
>   at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:354)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2790) Add high availability support for Yarn

2015-10-06 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944675#comment-14944675
 ] 

Till Rohrmann commented on FLINK-2790:
--

What did the logs say?

I try to reproduce it.

On Mon, Oct 5, 2015 at 5:50 PM, ASF GitHub Bot (JIRA) 



> Add high availability support for Yarn
> --
>
> Key: FLINK-2790
> URL: https://issues.apache.org/jira/browse/FLINK-2790
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Reporter: Till Rohrmann
> Fix For: 0.10
>
>
> Add master high availability support for Yarn. The idea is to let Yarn 
> restart a failed application master in a new container. For that, we set the 
> number of application retries to something greater than 1. 
> From version 2.4.0 onwards, it is possible to reuse already started 
> containers for the TaskManagers, thus, avoiding unnecessary restart delays.
> From version 2.6.0 onwards, it is possible to specify an interval in which 
> the number of application attempts have to be exceeded in order to fail the 
> job. This will prevent long running jobs from eventually depleting all 
> available application attempts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2329) Refactor RPCs from within the ExecutionGraph

2015-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2329:


 Summary: Refactor RPCs from within the ExecutionGraph
 Key: FLINK-2329
 URL: https://issues.apache.org/jira/browse/FLINK-2329
 Project: Flink
  Issue Type: Sub-task
Reporter: Till Rohrmann
Assignee: Till Rohrmann


Currently, we store an {{ActorRef}} of the TaskManager into an {{Instance}} 
object. This {{ActorRef}} is used from within {{Executions}} to interact with 
the {{TaskManager}}. This is not a nice abstraction since it does not hide 
implementation details. 

Since we need to add a leader session ID to messages sent by the {{Executions}} 
in order to support high availability, we would need to make the leader session 
ID available to the {{Execution}}. A better solution seems to be to replace the 
direct {{ActorRef}} interaction with an instance gateway abstraction which 
encapsulates the communication logic. Having such an abstraction, it will be 
easy to decorate messages transparently with a leader session ID. Therefore, I 
propose to refactor the current {{Instance}} communication and to introduce an 
{{InstanceGateway}} abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2332) Assign session IDs to JobManager and TaskManager messages

2015-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2332:


 Summary: Assign session IDs to JobManager and TaskManager messages
 Key: FLINK-2332
 URL: https://issues.apache.org/jira/browse/FLINK-2332
 Project: Flink
  Issue Type: Sub-task
Reporter: Till Rohrmann
Assignee: Till Rohrmann


In order to support true high availability {{TaskManager}} and {{JobManager}} 
have to be able to distinguish whether a message was sent from the leader or 
whether a message was sent from a former leader. Messages which come from a 
former leader have to be discarded in order to guarantee a consistent state.

A way to do achieve this is to assign a leader session ID to a {{JobManager}} 
once he's elected as leader. This leader session ID is sent to the 
{{TaskManager}} upon registration at the {{JobManager}}. All subsequent 
messages should then be decorated with this leader session ID to mark them as 
valid. On the {{TaskManager}} side the received leader session ID as a response 
to the registration message, can then be used to validate incoming messages.

The same holds true for registration messages which should have a registration 
session ID, too. That way, it is possible to distinguish invalid registration 
messages from valid ones. The registration session ID can be assigned once the 
TaskManager is notified about the new leader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2162) Implement adaptive learning rate strategies for SGD

2015-07-09 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2162:
-
Assignee: Ventura Del Monte

 Implement adaptive learning rate strategies for SGD
 ---

 Key: FLINK-2162
 URL: https://issues.apache.org/jira/browse/FLINK-2162
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Ventura Del Monte
Priority: Minor
  Labels: ML

 At the moment, the SGD implementation uses a simple adaptive learning rate 
 strategy, {{adaptedLearningRate = 
 initialLearningRate/sqrt(iterationNumber)}}, which makes the optimization 
 algorithm sensitive to the setting of the {{initialLearningRate}}. If this 
 value is chosen wrongly, then the SGD might become instable.
 There are better ways to calculate the learning rate [1] such as Adagrad [3], 
 Adadelta [4], SGD with momentum [5] others [2]. They promise to result in 
 more stable optimization algorithms which don't require so much 
 hyperparameter tweaking. It might be worthwhile to investigate these 
 approaches.
 It might also be interesting to look at the implementation of vowpal wabbit 
 [6].
 Resources:
 [1] [http://imgur.com/a/Hqolp]
 [2] [http://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html]
 [3] [http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf]
 [4] [http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf]
 [5] [http://www.willamette.edu/~gorr/classes/cs449/momrate.html]
 [6] [https://github.com/JohnLangford/vowpal_wabbit]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   8   9   10   >