[jira] [Commented] (FLINK-1848) Paths containing a Windows drive letter cannot be used in FileOutputFormats

2015-05-19 Thread Laurent Tardif (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551889#comment-14551889
 ] 

Laurent Tardif commented on FLINK-1848:
---

going a bit deepper in the investigation show that the PATH constructor is 
normalizing path
and so, for windows path : "/c:/" => become "/c:" and so put in the URI 
constructor gives : "file:/c:" which is a relative path, and not an absoute one.

In the Path constructors (2) the management of windows directory is different 
 bad smell.
Also , in the normalise path functon of the Path object, the removal of / at 
the end of the function do not manage correctly windows path.

Also, in the code : Path / File  management of windows path are not consistent, 
and managed at several place (constructor, several methods, )
windows Path are also considered as : "c:\...\"  what about "\\host\Path\" 
 that are also common.

May be managing internally only URI objects can be an idea to avoid complex 
exception management.


> Paths containing a Windows drive letter cannot be used in FileOutputFormats
> ---
>
> Key: FLINK-1848
> URL: https://issues.apache.org/jira/browse/FLINK-1848
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 0.9
> Environment: Windows (Cygwin and native)
>Reporter: Fabian Hueske
>Assignee: Fabian Hueske
>Priority: Critical
> Fix For: 0.9
>
>
> Paths that contain a Windows drive letter such as {{file:///c:/my/directory}} 
> cannot be used as output path for {{FileOutputFormat}}.
> If done, the following exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: file:c:
> at org.apache.flink.core.fs.Path.initialize(Path.java:242)
> at org.apache.flink.core.fs.Path.(Path.java:225)
> at org.apache.flink.core.fs.Path.(Path.java:138)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.pathToFile(LocalFileSystem.java:147)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:232)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.FileSystem.initOutPathLocalFS(FileSystem.java:603)
> at 
> org.apache.flink.api.common.io.FileOutputFormat.open(FileOutputFormat.java:233)
> at 
> org.apache.flink.api.java.io.CsvOutputFormat.open(CsvOutputFormat.java:158)
> at 
> org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:183)
> at 
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:c:
> at java.net.URI.checkPath(Unknown Source)
> at java.net.URI.(Unknown Source)
> at org.apache.flink.core.fs.Path.initialize(Path.java:240)
> ... 14 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2015-05-19 Thread Raghav Chalapathy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551753#comment-14551753
 ] 

Raghav Chalapathy commented on FLINK-1733:
--

Hi Till 
 Let me go through the paper shall present my analysis about the same
with regards
Raghav


> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Description: Implementation of Hits Algorithm in Gelly API using Java. the 
feature branch can be found here: 
(https://github.com/JavidMayar/flink/commits/HITS)  (was: Implementation of 
Hits Algorithm in Gelly API using Java. [the feature branch can be found here] 
(https://github.com/JavidMayar/flink/commits/HITS))

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java. the feature branch 
> can be found here: (https://github.com/JavidMayar/flink/commits/HITS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Description: Implementation of Hits Algorithm in Gelly API using Java. [the 
feature branch can be found here] 
(https://github.com/JavidMayar/flink/commits/HITS)  (was: Implementation of 
Hits Algorithm in Gelly API using Java. [the feature branch can be found 
here](https://github.com/JavidMayar/flink/commits/HITS))

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java. [the feature branch 
> can be found here] (https://github.com/JavidMayar/flink/commits/HITS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Description: Implementation of Hits Algorithm in Gelly API using Java. [the 
feature branch can be found 
here](https://github.com/JavidMayar/flink/commits/HITS)  (was: Implementation 
of Hits Algorithm in Gelly API using Java.)

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java. [the feature branch 
> can be found here](https://github.com/JavidMayar/flink/commits/HITS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Description: Implementation of Hits Algorithm in Gelly API using Java.  
(was: Implementation of Hits Algorithm in Gelly API using Java. The [feature 
branch can be found here.](https://github.com/JavidMayar/flink/commits/HITS))

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Description: Implementation of Hits Algorithm in Gelly API using Java. The 
[feature branch can be found 
here.](https://github.com/JavidMayar/flink/commits/HITS)  (was: Implementation 
of Hits Algorithm in Gelly API using Java )

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java. The [feature branch 
> can be found here.](https://github.com/JavidMayar/flink/commits/HITS)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Attachment: (was: HitsMain.java)

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Attachment: (was: Hits_Class.java)

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
>
> Implementation of Hits Algorithm in Gelly API using Java 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-1523) Vertex-centric iteration extensions

2015-05-19 Thread Vasia Kalavri (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasia Kalavri resolved FLINK-1523.
--
   Resolution: Implemented
Fix Version/s: 0.9

> Vertex-centric iteration extensions
> ---
>
> Key: FLINK-1523
> URL: https://issues.apache.org/jira/browse/FLINK-1523
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Andra Lungu
> Fix For: 0.9
>
>
> We would like to make the following extensions to the vertex-centric 
> iterations of Gelly:
> - allow vertices to access their in/out degrees and the total number of 
> vertices of the graph, inside the iteration.
> - allow choosing the neighborhood type (in/out/all) over which to run the 
> vertex-centric iteration. Now, the model uses the updates of the in-neighbors 
> to calculate state and send messages to out-neighbors. We could add a 
> parameter with value "in/out/all" to the {{VertexUpdateFunction}} and 
> {{MessagingFunction}}, that would indicate the type of neighborhood.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1523) Vertex-centric iteration extensions

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551220#comment-14551220
 ] 

ASF GitHub Bot commented on FLINK-1523:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/680


> Vertex-centric iteration extensions
> ---
>
> Key: FLINK-1523
> URL: https://issues.apache.org/jira/browse/FLINK-1523
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Andra Lungu
>
> We would like to make the following extensions to the vertex-centric 
> iterations of Gelly:
> - allow vertices to access their in/out degrees and the total number of 
> vertices of the graph, inside the iteration.
> - allow choosing the neighborhood type (in/out/all) over which to run the 
> vertex-centric iteration. Now, the model uses the updates of the in-neighbors 
> to calculate state and send messages to out-neighbors. We could add a 
> parameter with value "in/out/all" to the {{VertexUpdateFunction}} and 
> {{MessagingFunction}}, that would indicate the type of neighborhood.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [gelly][Refactoring] Removed example string

2015-05-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/625


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1523] vertex-centric iteration extensio...

2015-05-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/680


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Vasia Kalavri (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551203#comment-14551203
 ] 

Vasia Kalavri commented on FLINK-2044:
--

Hello [~JavidMayar]! Thank you for your efforts in contributing to Gelly :)
A HITS implementation would be a great addition to the library.

As a first step, I would suggest you look into the existing Gelly library 
algorithms under {{org.apache.flink.graph.library}}. You will see that library 
methods implement a simple interface. It would be great if you could change 
your implementation to follow the same format.

Taking a quick look at your working branch, I see that you have implemented the 
iteration with an explicit for-loop. I understand how this is a nice and 
convenient way to program, but this is not the most efficient way to do things 
at the moment. In fact, I would expect this to fail with a moderately big 
input. Please, see [this recent discussion | 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-hanging-between-job-executions-All-Pairs-Shortest-Paths-td1238.html]
 in our mailing list for details.

Instead of using a for-loop, I would suggest you look into Gelly's 
[vertex-centric iterations| 
http://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html#vertex-centric-iterations],
 which use [Flink's native iteration 
operators|http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators]
 internally.
Since HITS has two phases, it might not be very straight-forward how to fit it 
into this model, but it shouldn't be too hard if you track in which phase you 
are, in each superstep.

Let us know here or in the mailing lists if you have any questions!

-Vasia.

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
> Attachments: HitsMain.java, Hits_Class.java
>
>
> Implementation of Hits Algorithm in Gelly API using Java 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Pluggable state backend support for checkpoint...

2015-05-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/676


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (FLINK-2052) Clear up Serializable warnings in streaming operators

2015-05-19 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek reassigned FLINK-2052:
---

Assignee: Aljoscha Krettek

> Clear up Serializable warnings in streaming operators
> -
>
> Key: FLINK-2052
> URL: https://issues.apache.org/jira/browse/FLINK-2052
> Project: Flink
>  Issue Type: Task
>  Components: Streaming
>Reporter: Gyula Fora
>Assignee: Aljoscha Krettek
>Priority: Minor
>
> The recent stream operator rework removed the "serialVersionUID" fields from 
> operators which causes compiler warnings.
> These should be re-added to all serializable classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2051) Fix disabled test in ComplexIntegrationTest

2015-05-19 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551139#comment-14551139
 ] 

Aljoscha Krettek commented on FLINK-2051:
-

Thanks for spotting this, I forgot to add an issue.

> Fix disabled test in ComplexIntegrationTest
> ---
>
> Key: FLINK-2051
> URL: https://issues.apache.org/jira/browse/FLINK-2051
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming, Tests
>Affects Versions: 0.9
>Reporter: Márton Balassi
>
> One of the tests was marked as undesirable by [~aljoscha] in f59d6a1. Let us 
> make sure to either fix it or have it removed and the functionality tested in 
> another way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Pluggable state backend support for checkpoint...

2015-05-19 Thread senorcarbone
Github user senorcarbone commented on a diff in the pull request:

https://github.com/apache/flink/pull/676#discussion_r30637229
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/ByteStreamStateHandle.java
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.state;
+
+import java.io.InputStream;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.io.OutputStream;
+import java.io.Serializable;
+
+/**
+ * Statehandle that writes/reads the contents of the serializable 
checkpointed
+ * state to the provided input and outputstreams using default java
+ * serialization.
+ * 
+ */
+public abstract class ByteStreamStateHandle implements 
StateHandle {
+
+   private static final long serialVersionUID = -962025800339325828L;
+
+   private transient Serializable state;
--- End diff --

oh found it, indeed. It's the SerializedValue constructor actually. That's 
nice, good job  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Pluggable state backend support for checkpoint...

2015-05-19 Thread gyfora
Github user gyfora commented on a diff in the pull request:

https://github.com/apache/flink/pull/676#discussion_r30637002
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/ByteStreamStateHandle.java
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.state;
+
+import java.io.InputStream;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.io.OutputStream;
+import java.io.Serializable;
+
+/**
+ * Statehandle that writes/reads the contents of the serializable 
checkpointed
+ * state to the provided input and outputstreams using default java
+ * serialization.
+ * 
+ */
+public abstract class ByteStreamStateHandle implements 
StateHandle {
+
+   private static final long serialVersionUID = -962025800339325828L;
+
+   private transient Serializable state;
--- End diff --

We manually serialize all the time in the RuntimeEnvironment 
acknowledgeCheckpoint method into a StateForTask object :)
but good catch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [gelly][Refactoring] Removed example string

2015-05-19 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/625#issuecomment-103641540
  
I take it your update means you agreed with my suggestion :)
I'll merge this one as well!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Pluggable state backend support for checkpoint...

2015-05-19 Thread senorcarbone
Github user senorcarbone commented on the pull request:

https://github.com/apache/flink/pull/676#issuecomment-103640808
  
Apart from my comment on the transient field I think this is ok to merge. 
Perhaps we can look into standardising the task manager backend configuration 
afterwards. So it's :+1: from me :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Pluggable state backend support for checkpoint...

2015-05-19 Thread senorcarbone
Github user senorcarbone commented on a diff in the pull request:

https://github.com/apache/flink/pull/676#discussion_r30636159
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/state/ByteStreamStateHandle.java
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.state;
+
+import java.io.InputStream;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.io.OutputStream;
+import java.io.Serializable;
+
+/**
+ * Statehandle that writes/reads the contents of the serializable 
checkpointed
+ * state to the provided input and outputstreams using default java
+ * serialization.
+ * 
+ */
+public abstract class ByteStreamStateHandle implements 
StateHandle {
+
+   private static final long serialVersionUID = -962025800339325828L;
+
+   private transient Serializable state;
--- End diff --

The state here might potentially persist in the StateHandle as is (and get 
transferred to the JobManager) apart from being written to disk, if it happens 
to be in the same JVM. There is not guaranteed Serialisation with Akka...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1907) Scala Interactive Shell

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551050#comment-14551050
 ] 

ASF GitHub Bot commented on FLINK-1907:
---

Github user nikste commented on the pull request:

https://github.com/apache/flink/pull/672#issuecomment-103638341
  
Fixed according to comments and linux terminal issue, should now pass all 
tests.


> Scala Interactive Shell
> ---
>
> Key: FLINK-1907
> URL: https://issues.apache.org/jira/browse/FLINK-1907
> Project: Flink
>  Issue Type: New Feature
>  Components: Scala API
>Reporter: Nikolaas Steenbergen
>Assignee: Nikolaas Steenbergen
>Priority: Minor
>
> Build an interactive Shell for the Scala api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1907] Scala shell

2015-05-19 Thread nikste
Github user nikste commented on the pull request:

https://github.com/apache/flink/pull/672#issuecomment-103638341
  
Fixed according to comments and linux terminal issue, should now pass all 
tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance

2015-05-19 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550892#comment-14550892
 ] 

Stephan Ewen commented on FLINK-1952:
-

Next step in the diagnosis: There is apparently something wrong in the 
hierarchy of shared slots. The root slot (that is actually allocated from the 
TaskManager) is released, while the slot associated with the CoLocation group 
of the tasks is still alive.

> Cannot run ConnectedComponents example: Could not allocate a slot on instance
> -
>
> Key: FLINK-1952
> URL: https://issues.apache.org/jira/browse/FLINK-1952
> Project: Flink
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Priority: Blocker
>
> Steps to reproduce
> {code}
> ./bin/yarn-session.sh -n 350 
> {code}
> ... wait until they are connected ...
> {code}
> Number of connected TaskManagers changed to 266. Slots available: 266
> Number of connected TaskManagers changed to 323. Slots available: 323
> Number of connected TaskManagers changed to 334. Slots available: 334
> Number of connected TaskManagers changed to 343. Slots available: 343
> Number of connected TaskManagers changed to 350. Slots available: 350
> {code}
> Start CC
> {code}
> ./bin/flink run -p 350 
> ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
> {code}
> ---> it runs
> Run KMeans, let it fail with 
> {code}
> Failed to deploy the task Map (Map at main(KMeans.java:100)) (1/350) - 
> execution #0 to slot SimpleSlot (2)(2)(0) - 182b7661ca9547a84591de940c47a200 
> - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network 
> buffers: required 350, but only 254 available. The total number of network 
> buffers is currently set to 2048. You can increase this number by setting the 
> configuration key 'taskmanager.network.numberOfBuffers'.
> {code}
> ... as expected.
> (I've waited for 10 minutes between the two submissions)
> Starting CC now will fail:
> {code}
> ./bin/flink run -p 350 
> ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar 
> {code}
> Error message(s):
> {code}
> Caused by: java.lang.IllegalStateException: Could not schedule consumer 
> vertex IterationHead(WorksetIteration (Unnamed Delta Iteration)) (19/350)
>   at 
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:479)
>   at 
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:469)
>   at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
>   ... 4 more
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate a slot on instance 4a6d761cb084c32310ece1f849556faf @ 
> cloud-19 - 1 slots - URL: 
> akka.tcp://flink@130.149.21.23:51400/user/taskmanager, as required by the 
> co-location constraint.
>   at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:247)
>   at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:110)
>   at 
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:262)
>   at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:436)
>   at 
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:475)
>   ... 9 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2052) Clear up Serializable warnings in streaming operators

2015-05-19 Thread Gyula Fora (JIRA)
Gyula Fora created FLINK-2052:
-

 Summary: Clear up Serializable warnings in streaming operators
 Key: FLINK-2052
 URL: https://issues.apache.org/jira/browse/FLINK-2052
 Project: Flink
  Issue Type: Task
  Components: Streaming
Reporter: Gyula Fora
Priority: Minor


The recent stream operator rework removed the "serialVersionUID" fields from 
operators which causes compiler warnings.

These should be re-added to all serializable classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread HilmiYildirim
Github user HilmiYildirim commented on a diff in the pull request:

https://github.com/apache/flink/pull/695#discussion_r30622734
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-connectors/src/test/java/org/apache/flink/streaming/connectors/twitter/TwitterFilterSourceTest.java
 ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.streaming.connectors.twitter;
+
+import org.apache.flink.streaming.api.datastream.DataStream;
+import 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.connectors.json.JSONParseFlatMap;
+import org.apache.flink.util.Collector;
+
+public class TwitterFilterSourceTest {
+
+   private static final String PATH_TO_AUTH_FILE = "/twitter.properties";
+   
+   public static void main(String[] args) {
--- End diff --

But it would be possible to implement this as Junit tests and the tests 
will be skipped by default. Someone who wants to execute the tests has to use 
his own access keys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread HilmiYildirim
Github user HilmiYildirim commented on a diff in the pull request:

https://github.com/apache/flink/pull/695#discussion_r30622577
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-connectors/src/test/java/org/apache/flink/streaming/connectors/twitter/TwitterFilterSourceTest.java
 ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.streaming.connectors.twitter;
+
+import org.apache.flink.streaming.api.datastream.DataStream;
+import 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.connectors.json.JSONParseFlatMap;
+import org.apache.flink.util.Collector;
+
+public class TwitterFilterSourceTest {
+
+   private static final String PATH_TO_AUTH_FILE = "/twitter.properties";
+   
+   public static void main(String[] args) {
--- End diff --

Unfortunately, it is not possible it implement this as Junit Tests. The 
stream connection needs access keys of a twitter account and the maximum number 
of parallel streams per twitter account is restricted to 2. If more than 2 
people are executing the tests at the same time then the tests fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread HilmiYildirim
Github user HilmiYildirim commented on the pull request:

https://github.com/apache/flink/pull/695#issuecomment-103597087
  
Unfortunately, it is not possible ti implement this as Junit Tests. The 
stream connection needs access keys of a twitter account and the maximum number 
of parallel streams per twitter account is restricted to 2. If more then 2 
people are executing the test at the same time then the tests are not working.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (FLINK-2051) Fix disabled test in ComplexIntegrationTest

2015-05-19 Thread JIRA
Márton Balassi created FLINK-2051:
-

 Summary: Fix disabled test in ComplexIntegrationTest
 Key: FLINK-2051
 URL: https://issues.apache.org/jira/browse/FLINK-2051
 Project: Flink
  Issue Type: Bug
  Components: Streaming, Tests
Affects Versions: 0.9
Reporter: Márton Balassi


One of the tests was marked as undesirable by [~aljoscha] in f59d6a1. Let us 
make sure to either fix it or have it removed and the functionality tested in 
another way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread HilmiYildirim
Github user HilmiYildirim commented on a diff in the pull request:

https://github.com/apache/flink/pull/695#discussion_r30621383
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-connectors/src/main/java/org/apache/flink/streaming/connectors/twitter/TwitterFilterSource.java
 ---
@@ -0,0 +1,284 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.streaming.connectors.twitter;
+
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.concurrent.LinkedBlockingQueue;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.twitter.hbc.core.endpoint.Location;
+import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
+import com.twitter.hbc.httpclient.auth.Authentication;
+
+/**
+ * 
+ * An extension of {@link TwitterSource} by filter parameters. This 
extension
+ * enables to filter the twitter stream by user defined parameters.
+ * 
+ * @author Hilmi Yildirim
--- End diff --

Yes I can remove the annotation.

There is a problem with the junit tests. For them we need a twitter account
and the access keys for the stream connection have to be in the resource
folder. This means that other people can use the access keys which can
influence the tests. I think there is no secure way to implement junit
tests.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread hsaputra
Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/695#discussion_r30620688
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-connectors/src/test/java/org/apache/flink/streaming/connectors/twitter/TwitterFilterSourceTest.java
 ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.streaming.connectors.twitter;
+
+import org.apache.flink.streaming.api.datastream.DataStream;
+import 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.connectors.json.JSONParseFlatMap;
+import org.apache.flink.util.Collector;
+
+public class TwitterFilterSourceTest {
+
+   private static final String PATH_TO_AUTH_FILE = "/twitter.properties";
+   
+   public static void main(String[] args) {
--- End diff --

We are using JUnit for our tests so it would be great to convert this to 
JUnit style of test class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread hsaputra
Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/695#discussion_r30620574
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-connectors/src/main/java/org/apache/flink/streaming/connectors/twitter/TwitterFilterSource.java
 ---
@@ -0,0 +1,284 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.streaming.connectors.twitter;
+
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.concurrent.LinkedBlockingQueue;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.twitter.hbc.core.endpoint.Location;
+import com.twitter.hbc.core.endpoint.StatusesFilterEndpoint;
+import com.twitter.hbc.httpclient.auth.Authentication;
+
+/**
+ * 
+ * An extension of {@link TwitterSource} by filter parameters. This 
extension
+ * enables to filter the twitter stream by user defined parameters.
+ * 
+ * @author Hilmi Yildirim
--- End diff --

Could you remove @author javadoc annotation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/695#issuecomment-103579307
  
That would be really great, thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2012][gelly] Added methods to remove/ad...

2015-05-19 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/678#issuecomment-103574952
  
Thanks for this PR @andralungu!

There are a few things we need to consider here I think:
- when to have a `DataSet` argument and when to have a `List`. For example, 
`addVertices` receives a `DataSet` of vertices to add, but a `List` of edges, 
`addEdges` receives two datasets...
- do we really need two `addEdges` methods? If I understand correctly, the 
second `addEdges` method is to be used when the new edges connect existing 
vertices? What will (should) happen if there are edges with invalid ids (not in 
the vertex set)?
- could we maybe have only `addVertices`, `removeVertices`, `addEdges` etc. 
methods to cover also the single addition/deletion cases?
- the remove methods implementations can be simplified by using a single 
coGroup: if only the first group (existing vertices) contains an element you 
keep it, otherwise if both groups contain it you remove it (you are actually 
emulating this with the tag+union+reduce :)).

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2012) addVertices, addEdges, removeVertices, removeEdges methods

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550693#comment-14550693
 ] 

ASF GitHub Bot commented on FLINK-2012:
---

Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/678#issuecomment-103574952
  
Thanks for this PR @andralungu!

There are a few things we need to consider here I think:
- when to have a `DataSet` argument and when to have a `List`. For example, 
`addVertices` receives a `DataSet` of vertices to add, but a `List` of edges, 
`addEdges` receives two datasets...
- do we really need two `addEdges` methods? If I understand correctly, the 
second `addEdges` method is to be used when the new edges connect existing 
vertices? What will (should) happen if there are edges with invalid ids (not in 
the vertex set)?
- could we maybe have only `addVertices`, `removeVertices`, `addEdges` etc. 
methods to cover also the single addition/deletion cases?
- the remove methods implementations can be simplified by using a single 
coGroup: if only the first group (existing vertices) contains an element you 
keep it, otherwise if both groups contain it you remove it (you are actually 
emulating this with the tag+union+reduce :)).

What do you think?


> addVertices, addEdges, removeVertices, removeEdges methods
> --
>
> Key: FLINK-2012
> URL: https://issues.apache.org/jira/browse/FLINK-2012
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Andra Lungu
>Priority: Minor
>
> Currently, Gelly only allows the addition/deletion of one vertex/edge at a 
> time. If a user would want to add two (or more) vertices, he/she would need 
> to add a vertex-> create a new graph; then add another vertex -> another 
> graph etc.  
> It would be nice to also have addVertices, addEdges, removeVertices, 
> removeEdges methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2035) Update 0.9 roadmap with ML issues

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann resolved FLINK-2035.
--
Resolution: Fixed

Marked corresponding jira issues and added them to the google doc.

> Update 0.9 roadmap with ML issues
> -
>
> Key: FLINK-2035
> URL: https://issues.apache.org/jira/browse/FLINK-2035
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Theodore Vasiloudis
>Assignee: Till Rohrmann
> Fix For: 0.9
>
>
> The [current 
> list|https://issues.apache.org/jira/browse/FLINK-2001?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%200.9%20AND%20component%20%3D%20%22Machine%20Learning%20Library%22]
>  of issues linked with the 0.9 release is quite limited.
> We should go through the current ML issues and assign fix versions, so that 
> we have a clear view of what we expect to have in 0.9.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2050) Add pipelining mechanism for chainable transformers and estimators

2015-05-19 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-2050:


 Summary: Add pipelining mechanism for chainable transformers and 
estimators
 Key: FLINK-2050
 URL: https://issues.apache.org/jira/browse/FLINK-2050
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 0.9


The key concept of an easy to use ML library is the quick and simple 
construction of data analysis pipelines. Scikit-learn's approach to define 
transformers and estimators seems to be a really good solution to this problem. 
I propose to follow a similar path, because it makes FlinkML flexible in terms 
of code reuse as well as easy for people coming from Scikit-learn to use the 
FlinkML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2026) Error message in count() only jobs

2015-05-19 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550585#comment-14550585
 ] 

Sebastian Schelter commented on FLINK-2026:
---

In that case, it should be fine to show an error message, as there are pending 
operations. In my case there are none, so there is no need for an error message.

> Error message in count() only jobs
> --
>
> Key: FLINK-2026
> URL: https://issues.apache.org/jira/browse/FLINK-2026
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Reporter: Sebastian Schelter
>Assignee: Maximilian Michels
>Priority: Minor
>
> If I run a job that only calls count() on a dataset (which is a valid data 
> flow IMHO), Flink executes the job but complains that no sinks are defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1992) Add convergence criterion to SGD optimizer

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1992:
-
Fix Version/s: 0.9

> Add convergence criterion to SGD optimizer
> --
>
> Key: FLINK-1992
> URL: https://issues.apache.org/jira/browse/FLINK-1992
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Theodore Vasiloudis
>Priority: Minor
>  Labels: ML
> Fix For: 0.9
>
>
> Currently, Flink's SGD optimizer runs for a fixed number of iterations. It 
> would be good to support a dynamic convergence criterion, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1993) Replace MultipleLinearRegression's custom SGD with optimization framework's SGD

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1993:
-
Fix Version/s: 0.9

> Replace MultipleLinearRegression's custom SGD with optimization framework's 
> SGD
> ---
>
> Key: FLINK-1993
> URL: https://issues.apache.org/jira/browse/FLINK-1993
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Theodore Vasiloudis
>Priority: Minor
>  Labels: ML
> Fix For: 0.9
>
>
> The current implementation of MultipleLinearRegression uses a custom SGD 
> implementation. Flink's optimization framework also contains a SGD optimizer 
> which should replace the custom implementation once the framework is merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2030:
-
Assignee: Sachin Goel

> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2044) Implementation of Gelly HITS Algorithm

2015-05-19 Thread Ahamd Javid (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahamd Javid updated FLINK-2044:
---
Summary: Implementation of Gelly HITS Algorithm  (was: Implementation of 
Gelly Algorithm HITS)

> Implementation of Gelly HITS Algorithm
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
> Attachments: HitsMain.java, Hits_Class.java
>
>
> Implementation of Hits Algorithm in Gelly API using Java 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2044) Implementation of Gelly Algorithm HITS

2015-05-19 Thread Alexander Alexandrov (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550568#comment-14550568
 ] 

Alexander Alexandrov commented on FLINK-2044:
-

The [feature branch can be found 
here|https://github.com/JavidMayar/flink/commits/HITS].

> Implementation of Gelly Algorithm HITS
> --
>
> Key: FLINK-2044
> URL: https://issues.apache.org/jira/browse/FLINK-2044
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Reporter: Ahamd Javid
>Priority: Minor
> Attachments: HitsMain.java, Hits_Class.java
>
>
> Implementation of Hits Algorithm in Gelly API using Java 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-1971) Add user function type to StreamOperator

2015-05-19 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek closed FLINK-1971.
---
   Resolution: Fixed
Fix Version/s: 0.9

Resolved in 
https://github.com/apache/flink/commit/58865ff378720149134a93c650f2765e25bd1fb3

> Add user function type to StreamOperator
> 
>
> Key: FLINK-1971
> URL: https://issues.apache.org/jira/browse/FLINK-1971
> Project: Flink
>  Issue Type: Improvement
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>Priority: Minor
> Fix For: 0.9
>
>
> Right now, StreamOperator has a field userFunction of type Function. If we 
> specialise this using an additional generic parameter then the derived stream 
> operators don't need to use manual casting every time they access the user 
> function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-1977) Rework Stream Operators to always be push based

2015-05-19 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek closed FLINK-1977.
---
   Resolution: Fixed
Fix Version/s: 0.9

Fixed in 
https://github.com/apache/flink/commit/58865ff378720149134a93c650f2765e25bd1fb3

> Rework Stream Operators to always be push based
> ---
>
> Key: FLINK-1977
> URL: https://issues.apache.org/jira/browse/FLINK-1977
> Project: Flink
>  Issue Type: Improvement
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
> Fix For: 0.9
>
>
> This is a result of the discussion on the mailing list. This is an excerpt 
> from the mailing list that gives the basic idea of the change:
> I propose to change all streaming operators to be push based, with a
> slightly improved interface: In addition to collect(), which I would
> call receiveElement() I would add receivePunctuation() and
> receiveBarrier(). The first operator in the chain would also get data
> from the outside invokable that reads from the input iterator and
> calls receiveElement() for the first operator in a chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2009) Time-Based Windows fail with Chaining Disabled

2015-05-19 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek closed FLINK-2009.
---
Resolution: Fixed

Fixed in 
https://github.com/apache/flink/commit/a04f091217cedccb00bb21a4686350e7709adea9

> Time-Based Windows fail with Chaining Disabled
> --
>
> Key: FLINK-2009
> URL: https://issues.apache.org/jira/browse/FLINK-2009
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> When disabling chaining a streaming job with a time-based window fails right 
> away with this:
> {code}
> 14:17:50,917 ERROR org.apache.flink.streaming.api.collector.StreamOutput  
>- Emit failed due to: org.apache.flink.types.NullFieldException: Field 0 
> is null, but expected to hold a value.
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSerializer.java:118)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSerializer.java:30)
>   at 
> org.apache.flink.streaming.runtime.streamrecord.StreamRecordSerializer.serialize(StreamRecordSerializer.java:89)
>   at 
> org.apache.flink.streaming.runtime.streamrecord.StreamRecordSerializer.serialize(StreamRecordSerializer.java:29)
>   at 
> org.apache.flink.runtime.plugable.SerializationDelegate.write(SerializationDelegate.java:51)
>   at 
> org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer.addRecord(SpanningRecordSerializer.java:76)
>   at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:83)
>   at 
> org.apache.flink.streaming.api.collector.StreamOutput.collect(StreamOutput.java:63)
>   at 
> org.apache.flink.streaming.api.collector.CollectorWrapper.collect(CollectorWrapper.java:39)
>   at 
> org.apache.flink.streaming.api.operators.windowing.StreamDiscretizer.emitWindow(StreamDiscretizer.java:143)
>   at 
> org.apache.flink.streaming.api.operators.windowing.StreamDiscretizer.triggerOnFakeElement(StreamDiscretizer.java:131)
>   at 
> org.apache.flink.streaming.api.operators.windowing.StreamDiscretizer$WindowingCallback.sendFakeElement(StreamDiscretizer.java:194)
>   at 
> org.apache.flink.streaming.api.windowing.policy.TimeTriggerPolicy.activeFakeElementEmission(TimeTriggerPolicy.java:121)
>   at 
> org.apache.flink.streaming.api.windowing.policy.TimeTriggerPolicy$TimeCheck.run(TimeTriggerPolicy.java:148)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I assume it is because the window events that get emitted for fake elements 
> are not properly filled. This does not surface if chaining is enabled because 
> then the window events don't pass through the serialisation stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1977) Rework Stream Operators to always be push based

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550549#comment-14550549
 ] 

ASF GitHub Bot commented on FLINK-1977:
---

Github user aljoscha closed the pull request at:

https://github.com/apache/flink/pull/659


> Rework Stream Operators to always be push based
> ---
>
> Key: FLINK-1977
> URL: https://issues.apache.org/jira/browse/FLINK-1977
> Project: Flink
>  Issue Type: Improvement
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> This is a result of the discussion on the mailing list. This is an excerpt 
> from the mailing list that gives the basic idea of the change:
> I propose to change all streaming operators to be push based, with a
> slightly improved interface: In addition to collect(), which I would
> call receiveElement() I would add receivePunctuation() and
> receiveBarrier(). The first operator in the chain would also get data
> from the outside invokable that reads from the input iterator and
> calls receiveElement() for the first operator in a chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1977) Rework Stream Operators to always be push based

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550548#comment-14550548
 ] 

ASF GitHub Bot commented on FLINK-1977:
---

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/659#issuecomment-103530579
  
Manually merged


> Rework Stream Operators to always be push based
> ---
>
> Key: FLINK-1977
> URL: https://issues.apache.org/jira/browse/FLINK-1977
> Project: Flink
>  Issue Type: Improvement
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> This is a result of the discussion on the mailing list. This is an excerpt 
> from the mailing list that gives the basic idea of the change:
> I propose to change all streaming operators to be push based, with a
> slightly improved interface: In addition to collect(), which I would
> call receiveElement() I would add receivePunctuation() and
> receiveBarrier(). The first operator in the chain would also get data
> from the outside invokable that reads from the input iterator and
> calls receiveElement() for the first operator in a chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1977] Rework Stream Operators to always...

2015-05-19 Thread aljoscha
Github user aljoscha closed the pull request at:

https://github.com/apache/flink/pull/659


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1977] Rework Stream Operators to always...

2015-05-19 Thread aljoscha
Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/659#issuecomment-103530579
  
Manually merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-2034) Add vision and roadmap for ML library to docs

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2034:
-
Fix Version/s: 0.9

> Add vision and roadmap for ML library to docs
> -
>
> Key: FLINK-2034
> URL: https://issues.apache.org/jira/browse/FLINK-2034
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Theodore Vasiloudis
>Assignee: Theodore Vasiloudis
>  Labels: ML
> Fix For: 0.9
>
>
> We should have a document describing the vision of the Machine Learning 
> library in Flink and an up to date roadmap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2035) Update 0.9 roadmap with ML issues

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2035:
-
Fix Version/s: 0.9

> Update 0.9 roadmap with ML issues
> -
>
> Key: FLINK-2035
> URL: https://issues.apache.org/jira/browse/FLINK-2035
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Theodore Vasiloudis
>Assignee: Till Rohrmann
> Fix For: 0.9
>
>
> The [current 
> list|https://issues.apache.org/jira/browse/FLINK-2001?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%200.9%20AND%20component%20%3D%20%22Machine%20Learning%20Library%22]
>  of issues linked with the 0.9 release is quite limited.
> We should go through the current ML issues and assign fix versions, so that 
> we have a clear view of what we expect to have in 0.9.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2049) KafkaSink sporadically fails to send message

2015-05-19 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-2049:
-

 Summary: KafkaSink sporadically fails to send message
 Key: FLINK-2049
 URL: https://issues.apache.org/jira/browse/FLINK-2049
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector, Streaming
Affects Versions: 0.9
Reporter: Robert Metzger


This test https://travis-ci.org/StephanEwen/incubator-flink/jobs/63147661 
failed with:
{code}
10:38:22,415 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask  
 - StreamOperator failed due to: java.lang.RuntimeException: 
java.lang.RuntimeException: kafka.common.FailedToSendMessageException: Failed 
to send messages after 10 tries.
at 
org.apache.flink.streaming.api.operators.StreamOperator.callUserFunctionAndLogException(StreamOperator.java:142)
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:34)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:139)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: 
kafka.common.FailedToSendMessageException: Failed to send messages after 10 
tries.
at 
org.apache.flink.streaming.api.operators.StreamOperator.callUserFunctionAndLogException(StreamOperator.java:142)
at 
org.apache.flink.streaming.api.operators.ChainableStreamOperator.collect(ChainableStreamOperator.java:54)
at 
org.apache.flink.streaming.api.collector.CollectorWrapper.collect(CollectorWrapper.java:39)
at 
org.apache.flink.streaming.connectors.kafka.KafkaITCase$3.run(KafkaITCase.java:326)
at 
org.apache.flink.streaming.api.operators.StreamSource.callUserFunction(StreamSource.java:40)
at 
org.apache.flink.streaming.api.operators.StreamOperator.callUserFunctionAndLogException(StreamOperator.java:137)
... 4 more
Caused by: kafka.common.FailedToSendMessageException: Failed to send messages 
after 10 tries.
at 
kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
at kafka.producer.Producer.send(Producer.scala:77)
at kafka.javaapi.producer.Producer.send(Producer.scala:33)
at 
org.apache.flink.streaming.connectors.kafka.api.KafkaSink.invoke(KafkaSink.java:183)
at 
org.apache.flink.streaming.api.operators.StreamSink.callUserFunction(StreamSink.java:39)
at 
org.apache.flink.streaming.api.operators.StreamOperator.callUserFunctionAndLogException(StreamOperator.java:137)
... 9 more
{code}

I've extracted the relevant logs: 
https://gist.github.com/rmetzger/ddbb0fead5efdd58a539.

The error comes from Kafka's producer code. We are not doing much in our Kafka 
Sink, so I really think this is not really a flink issue.

When the issue occurs again, I'll write to the Kafka list so seek for help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1992) Add convergence criterion to SGD optimizer

2015-05-19 Thread Theodore Vasiloudis (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis updated FLINK-1992:
---
Priority: Minor  (was: Major)

> Add convergence criterion to SGD optimizer
> --
>
> Key: FLINK-1992
> URL: https://issues.apache.org/jira/browse/FLINK-1992
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Theodore Vasiloudis
>Priority: Minor
>  Labels: ML
>
> Currently, Flink's SGD optimizer runs for a fixed number of iterations. It 
> would be good to support a dynamic convergence criterion, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2007) Initial data point in Delta function needs to be serializable

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550482#comment-14550482
 ] 

ASF GitHub Bot commented on FLINK-2007:
---

GitHub user mbalassi opened a pull request:

https://github.com/apache/flink/pull/697

[FLINK-2007] [streaming] Proper Delta policy serialization

The initial value passed to the Delta policy has to be serialized to be 
transferred to the cluster. This change adds the standard Flink serializers for 
that job. The API is preserved on the level of the `Delta`, but the 
`DeltaPolicy` itself explicitly needs the serializer.

Test4 in`ComplexIntegrationTest` demonstrates and tests my change in action.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mbalassi/flink delta-serialize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/697.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #697


commit 9d31a3a6ca4512d3031a38154bf1475c52375b64
Author: mbalassi 
Date:   2015-05-19T13:51:34Z

[FLINK-2007] [streaming] Proper Delta policy serialization




> Initial data point in Delta function needs to be serializable
> -
>
> Key: FLINK-2007
> URL: https://issues.apache.org/jira/browse/FLINK-2007
> Project: Flink
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 0.9
>Reporter: Márton Balassi
>Assignee: Márton Balassi
>  Labels: serialization, windowing
> Fix For: 0.9
>
>
> Currently we expect the data point passed to the delta function to be 
> serializable, which breaks the serialization assumptions provided in other 
> parts of the code.
> This information should be properly serialized by Flink serializers instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2007] [streaming] Proper Delta policy s...

2015-05-19 Thread mbalassi
GitHub user mbalassi opened a pull request:

https://github.com/apache/flink/pull/697

[FLINK-2007] [streaming] Proper Delta policy serialization

The initial value passed to the Delta policy has to be serialized to be 
transferred to the cluster. This change adds the standard Flink serializers for 
that job. The API is preserved on the level of the `Delta`, but the 
`DeltaPolicy` itself explicitly needs the serializer.

Test4 in`ComplexIntegrationTest` demonstrates and tests my change in action.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mbalassi/flink delta-serialize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/697.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #697


commit 9d31a3a6ca4512d3031a38154bf1475c52375b64
Author: mbalassi 
Date:   2015-05-19T13:51:34Z

[FLINK-2007] [streaming] Proper Delta policy serialization




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-1733) Add PCA to machine learning library

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1733:
-
Description: 
Dimension reduction is a crucial prerequisite for many data analysis tasks. 
Therefore, Flink's machine learning library should contain a principal 
components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
proposes a distributed PCA. A more recent publication describes another 
scalable PCA implementation.

Resources:
[1] [http://arxiv.org/pdf/1408.5823v5.pdf]
[2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]

  was:
Dimension reduction is a crucial prerequisite for many data analysis tasks. 
Therefore, Flink's machine learning library should contain a principal 
components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
proposes a distributed PCA.

Resources:
[1] [http://arxiv.org/pdf/1408.5823v5.pdf]


> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1733) Add PCA to machine learning library

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1733:
-
Description: 
Dimension reduction is a crucial prerequisite for many data analysis tasks. 
Therefore, Flink's machine learning library should contain a principal 
components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
proposes a distributed PCA. A more recent publication [2] describes another 
scalable PCA implementation.

Resources:
[1] [http://arxiv.org/pdf/1408.5823v5.pdf]
[2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]

  was:
Dimension reduction is a crucial prerequisite for many data analysis tasks. 
Therefore, Flink's machine learning library should contain a principal 
components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
proposes a distributed PCA. A more recent publication describes another 
scalable PCA implementation.

Resources:
[1] [http://arxiv.org/pdf/1408.5823v5.pdf]
[2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]


> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread HilmiYildirim
Github user HilmiYildirim commented on the pull request:

https://github.com/apache/flink/pull/695#issuecomment-103499923
  
It is possible to implement an integration test. For example, I can 
track a term and then I has to check if all tweets contain the term. If 
you want I can implement such tests.

Best Regards,
Hilmi

Am 19.05.2015 um 15:39 schrieb Robert Metzger:
>
> Thank you for working on the issue. I'll soon review your changes in 
> detail.
> From what I saw by scrolling over it, it looks very good (with log 
> messages and documentation).
> One thing that caught my attention was that the included test is not 
> really a unit or integration test. Its more like an example.
> Do you think there is a way to turn this into a real JUnit test?
>
> —
> Reply to this email directly or view it on GitHub 
> .
>

-- 
--
Hilmi Yildirim
Software Developer R&D

T: +49 30 24627-281
hilmi.yildi...@neofonie.de

http://www.neofonie.de

Besuchen Sie den Neo Tech Blog für Anwender:
http://blog.neofonie.de/

Folgen Sie uns:
https://plus.google.com/+neofonie
http://www.linkedin.com/company/neofonie-gmbh
https://www.xing.com/companies/neofoniegmbh

Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
Handelsregister Berlin-Charlottenburg: HRB 67460
Geschäftsführung: Thomas Kitlitschko




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2015-05-19 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550463#comment-14550463
 ] 

Till Rohrmann commented on FLINK-1733:
--

I found a new publication on scalable PCA computation 
[http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf].

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1733) Add PCA to machine learning library

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-1733:
-
Priority: Minor  (was: Major)

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1979) Implement Loss Functions

2015-05-19 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550452#comment-14550452
 ] 

Till Rohrmann commented on FLINK-1979:
--

The easiest way would probably be to simply cherry pick your commits onto the 
current master branch and to update the pull request.

> Implement Loss Functions
> 
>
> Key: FLINK-1979
> URL: https://issues.apache.org/jira/browse/FLINK-1979
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Johannes Günther
>Assignee: Johannes Günther
>Priority: Minor
>  Labels: ML
>
> For convex optimization problems, optimizer methods like SGD rely on a 
> pluggable implementation of a loss function and its first derivative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2046) Quickstarts are missing on the new Website

2015-05-19 Thread Robert Metzger (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-2046:
--
Component/s: Quickstarts

> Quickstarts are missing on the new Website
> --
>
> Key: FLINK-2046
> URL: https://issues.apache.org/jira/browse/FLINK-2046
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website, Quickstarts
>Reporter: Stephan Ewen
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1999) TF-IDF transformer

2015-05-19 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550448#comment-14550448
 ] 

Till Rohrmann commented on FLINK-1999:
--

You probably have to apply a similar trick as the {{FeatureHasher}} to map a 
word of a dictionary with unknown size to a fixed-length vector. Alternatively, 
you first calculate the dictionary of all known words and the corresponding 
mapping.

> TF-IDF transformer
> --
>
> Key: FLINK-1999
> URL: https://issues.apache.org/jira/browse/FLINK-1999
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Ronny Bräunlich
>Assignee: Alexander Alexandrov
>Priority: Minor
>  Labels: ML
>
> Hello everybody,
> we are a group of three students from TU Berlin (I guess we're not the first 
> group creating an issue) and we want to/have to implement a tf-idf tranformer 
> for Flink.
> Our lecturer Alexander told us that we could get some guidance here and that 
> you could point us to an old version of a similar tranformer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/695#issuecomment-103498471
  
Thank you for working on the issue. I'll soon review your changes in detail.
From what I saw by scrolling over it, it looks very good (with log messages 
and documentation).
One thing that caught my attention was that the included test is not really 
a unit or integration test. Its more like an example.
Do you think there is a way to turn this into a real JUnit test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2048) Enhance Twitter Stream support

2015-05-19 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550446#comment-14550446
 ] 

Robert Metzger commented on FLINK-2048:
---

Hi [~HilmiYildirim], I gave you "Contributor" permissions in our JIRA to assign 
you to this issue.

Thank you for impoving the TwitterSource. I recently came across the code as 
well and found it in a pretty bad shape. I opened FLINK-1964 back then, but I 
agree that we need to have this second issue as well.
I'll have a look at your pull request soon...

> Enhance Twitter Stream support
> --
>
> Key: FLINK-2048
> URL: https://issues.apache.org/jira/browse/FLINK-2048
> Project: Flink
>  Issue Type: Task
>  Components: Streaming
>Affects Versions: master
>Reporter: Hilmi Yildirim
>Assignee: Hilmi Yildirim
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Flink does not have a real twitter support. It only has a TwitterSource which 
> uses a sample stream which can not be used properly for analysis. It is 
> possible to use external tools to create streams (e.g. Kafka) but it is 
> beneficially to create a propert twitter stream in Flink.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2048) Enhance Twitter Stream support

2015-05-19 Thread Robert Metzger (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-2048:
--
Assignee: Hilmi Yildirim

> Enhance Twitter Stream support
> --
>
> Key: FLINK-2048
> URL: https://issues.apache.org/jira/browse/FLINK-2048
> Project: Flink
>  Issue Type: Task
>  Components: Streaming
>Affects Versions: master
>Reporter: Hilmi Yildirim
>Assignee: Hilmi Yildirim
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Flink does not have a real twitter support. It only has a TwitterSource which 
> uses a sample stream which can not be used properly for analysis. It is 
> possible to use external tools to create streams (e.g. Kafka) but it is 
> beneficially to create a propert twitter stream in Flink.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550425#comment-14550425
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

GitHub user chiwanpark opened a pull request:

https://github.com/apache/flink/pull/696

[FLINK-1745] [ml] [WIP] Add exact k-nearest-neighbours algorithm to machine 
learning library

This PR is not final but work in progress. You can see detail description 
in [JIRA](https://issues.apache.org/jira/browse/FLINK-1745).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chiwanpark/flink FLINK-1745

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/696.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #696


commit 804abe12c9503265171291a2841332341fe972be
Author: Chiwan Park 
Date:   2015-05-15T05:45:50Z

kNN join first draft




> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Chiwan Park
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...

2015-05-19 Thread chiwanpark
GitHub user chiwanpark opened a pull request:

https://github.com/apache/flink/pull/696

[FLINK-1745] [ml] [WIP] Add exact k-nearest-neighbours algorithm to machine 
learning library

This PR is not final but work in progress. You can see detail description 
in [JIRA](https://issues.apache.org/jira/browse/FLINK-1745).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chiwanpark/flink FLINK-1745

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/696.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #696


commit 804abe12c9503265171291a2841332341fe972be
Author: Chiwan Park 
Date:   2015-05-15T05:45:50Z

kNN join first draft




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1992) Add convergence criterion to SGD optimizer

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550385#comment-14550385
 ] 

ASF GitHub Bot commented on FLINK-1992:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/692#discussion_r30595561
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/optimization/GradientDescentITSuite.scala
 ---
@@ -205,6 +190,61 @@ class GradientDescentITSuite extends FlatSpec with 
Matchers with FlinkTestBase {
 
   }
 
-  // TODO: Need more corner cases
+  it should "terminate early if the convergence criterion is reached" in {
+// TODO(tvas): We need a better way to check the convergence of the 
weights.
+// Ideally we want to have a Breeze-like system, where the optimizers 
carry a history and that
+// can tell us whether we have converged and at which iteration
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+env.setParallelism(2)
+
+val sgdEarlyTerminate = GradientDescent()
+  .setConvergenceThreshold(1e2)
+  .setStepsize(1.0)
+  .setIterations(800)
+  .setLossFunction(SquaredLoss())
+  .setRegularizationType(NoRegularization())
+  .setRegularizationParameter(0.0)
+
+val inputDS = env.fromCollection(data)
+
+val weightDSEarlyTerminate = sgdEarlyTerminate.optimize(inputDS, None)
+
+val weightListEarly: Seq[WeightVector] = 
weightDSEarlyTerminate.collect()
+
+weightListEarly.size should equal(1)
+
+val weightVectorEarly: WeightVector = weightListEarly.head
+val weightsEarly = 
weightVectorEarly.weights.asInstanceOf[DenseVector].data
+val weight0Early = weightVectorEarly.intercept
+
+val sgdNoConvergence = GradientDescent()
+  .setStepsize(1.0)
--- End diff --

One solution is for the set*Parameter methods to return this.type instead 
of Solver, IterativeSolver etc.

Any objections to using this approach?


> Add convergence criterion to SGD optimizer
> --
>
> Key: FLINK-1992
> URL: https://issues.apache.org/jira/browse/FLINK-1992
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Theodore Vasiloudis
>  Labels: ML
>
> Currently, Flink's SGD optimizer runs for a fixed number of iterations. It 
> would be good to support a dynamic convergence criterion, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1992] [ml] Add convergence criterion to...

2015-05-19 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/692#discussion_r30595561
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/optimization/GradientDescentITSuite.scala
 ---
@@ -205,6 +190,61 @@ class GradientDescentITSuite extends FlatSpec with 
Matchers with FlinkTestBase {
 
   }
 
-  // TODO: Need more corner cases
+  it should "terminate early if the convergence criterion is reached" in {
+// TODO(tvas): We need a better way to check the convergence of the 
weights.
+// Ideally we want to have a Breeze-like system, where the optimizers 
carry a history and that
+// can tell us whether we have converged and at which iteration
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+env.setParallelism(2)
+
+val sgdEarlyTerminate = GradientDescent()
+  .setConvergenceThreshold(1e2)
+  .setStepsize(1.0)
+  .setIterations(800)
+  .setLossFunction(SquaredLoss())
+  .setRegularizationType(NoRegularization())
+  .setRegularizationParameter(0.0)
+
+val inputDS = env.fromCollection(data)
+
+val weightDSEarlyTerminate = sgdEarlyTerminate.optimize(inputDS, None)
+
+val weightListEarly: Seq[WeightVector] = 
weightDSEarlyTerminate.collect()
+
+weightListEarly.size should equal(1)
+
+val weightVectorEarly: WeightVector = weightListEarly.head
+val weightsEarly = 
weightVectorEarly.weights.asInstanceOf[DenseVector].data
+val weight0Early = weightVectorEarly.intercept
+
+val sgdNoConvergence = GradientDescent()
+  .setStepsize(1.0)
--- End diff --

One solution is for the set*Parameter methods to return this.type instead 
of Solver, IterativeSolver etc.

Any objections to using this approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1848) Paths containing a Windows drive letter cannot be used in FileOutputFormats

2015-05-19 Thread Laurent Tardif (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550382#comment-14550382
 ] 

Laurent Tardif commented on FLINK-1848:
---

There also a little issue in the test that validate FileOutputFormat class.
ex : in the method testCreateNonParallelLocalFS 
 tmpOutFile = new File(tmpOutPath.getAbsolutePath()+"/1");
isntead of : 
tmpOutFile = new File(tmpOutPath.getAbsolutePath()+File.separatorChar+"1");

For the bug : the issue happen (for me) in the mkdirs function of the 
localFileSystem class  when it didnot detect than file:/C/ is the root of the 
filesystem.
also, it look strange not to verify at the begining of the function, that the 
parameter is a directory, and that we do not need to recurse up to the root of 
the filesystem to validate the path 
public boolean mkdirs(final Path f) throws IOException {
if (pathToFile(f).isDirectory()) {return true ;}



> Paths containing a Windows drive letter cannot be used in FileOutputFormats
> ---
>
> Key: FLINK-1848
> URL: https://issues.apache.org/jira/browse/FLINK-1848
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 0.9
> Environment: Windows (Cygwin and native)
>Reporter: Fabian Hueske
>Assignee: Fabian Hueske
>Priority: Critical
> Fix For: 0.9
>
>
> Paths that contain a Windows drive letter such as {{file:///c:/my/directory}} 
> cannot be used as output path for {{FileOutputFormat}}.
> If done, the following exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
> Relative path in absolute URI: file:c:
> at org.apache.flink.core.fs.Path.initialize(Path.java:242)
> at org.apache.flink.core.fs.Path.(Path.java:225)
> at org.apache.flink.core.fs.Path.(Path.java:138)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.pathToFile(LocalFileSystem.java:147)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:232)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.local.LocalFileSystem.mkdirs(LocalFileSystem.java:233)
> at 
> org.apache.flink.core.fs.FileSystem.initOutPathLocalFS(FileSystem.java:603)
> at 
> org.apache.flink.api.common.io.FileOutputFormat.open(FileOutputFormat.java:233)
> at 
> org.apache.flink.api.java.io.CsvOutputFormat.open(CsvOutputFormat.java:158)
> at 
> org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:183)
> at 
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:c:
> at java.net.URI.checkPath(Unknown Source)
> at java.net.URI.(Unknown Source)
> at org.apache.flink.core.fs.Path.initialize(Path.java:240)
> ... 14 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2030:
-
Priority: Minor  (was: Major)

> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Priority: Minor
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2030) Implement an online histogram with Merging and equalization features

2015-05-19 Thread Till Rohrmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-2030:
-
Labels: ML  (was: )

> Implement an online histogram with Merging and equalization features
> 
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
>  Issue Type: Sub-task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>  Labels: ML
>
> For the implementation of the decision tree in 
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an 
> histogram with online updates, merging and equalization features. A reference 
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2048) Enhance Twitter Stream support

2015-05-19 Thread Hilmi Yildirim (JIRA)
Hilmi Yildirim created FLINK-2048:
-

 Summary: Enhance Twitter Stream support
 Key: FLINK-2048
 URL: https://issues.apache.org/jira/browse/FLINK-2048
 Project: Flink
  Issue Type: Task
  Components: Streaming
Affects Versions: master
Reporter: Hilmi Yildirim


Flink does not have a real twitter support. It only has a TwitterSource which 
uses a sample stream which can not be used properly for analysis. It is 
possible to use external tools to create streams (e.g. Kafka) but it is 
beneficially to create a propert twitter stream in Flink.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1952) Cannot run ConnectedComponents example: Could not allocate a slot on instance

2015-05-19 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550364#comment-14550364
 ] 

Stephan Ewen commented on FLINK-1952:
-

Okay, I am able to reproduce it locally with a local mini cluster using 100 
taskmanagers.

Not so mini any more, that minicluster ;-)

> Cannot run ConnectedComponents example: Could not allocate a slot on instance
> -
>
> Key: FLINK-1952
> URL: https://issues.apache.org/jira/browse/FLINK-1952
> Project: Flink
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Priority: Blocker
>
> Steps to reproduce
> {code}
> ./bin/yarn-session.sh -n 350 
> {code}
> ... wait until they are connected ...
> {code}
> Number of connected TaskManagers changed to 266. Slots available: 266
> Number of connected TaskManagers changed to 323. Slots available: 323
> Number of connected TaskManagers changed to 334. Slots available: 334
> Number of connected TaskManagers changed to 343. Slots available: 343
> Number of connected TaskManagers changed to 350. Slots available: 350
> {code}
> Start CC
> {code}
> ./bin/flink run -p 350 
> ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar
> {code}
> ---> it runs
> Run KMeans, let it fail with 
> {code}
> Failed to deploy the task Map (Map at main(KMeans.java:100)) (1/350) - 
> execution #0 to slot SimpleSlot (2)(2)(0) - 182b7661ca9547a84591de940c47a200 
> - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network 
> buffers: required 350, but only 254 available. The total number of network 
> buffers is currently set to 2048. You can increase this number by setting the 
> configuration key 'taskmanager.network.numberOfBuffers'.
> {code}
> ... as expected.
> (I've waited for 10 minutes between the two submissions)
> Starting CC now will fail:
> {code}
> ./bin/flink run -p 350 
> ./examples/flink-java-examples-0.9-SNAPSHOT-ConnectedComponents.jar 
> {code}
> Error message(s):
> {code}
> Caused by: java.lang.IllegalStateException: Could not schedule consumer 
> vertex IterationHead(WorksetIteration (Unnamed Delta Iteration)) (19/350)
>   at 
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:479)
>   at 
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:469)
>   at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
>   ... 4 more
> Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate a slot on instance 4a6d761cb084c32310ece1f849556faf @ 
> cloud-19 - 1 slots - URL: 
> akka.tcp://flink@130.149.21.23:51400/user/taskmanager, as required by the 
> co-location constraint.
>   at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:247)
>   at 
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:110)
>   at 
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:262)
>   at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:436)
>   at 
> org.apache.flink.runtime.executiongraph.Execution$3.call(Execution.java:475)
>   ... 9 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2037) Unable to start Python API using ./bin/pyflink*.sh

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550361#comment-14550361
 ] 

ASF GitHub Bot commented on FLINK-2037:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/691#issuecomment-103471022
  
I think this is a hacky fix.

Sounds like the right approach is to have a python cli frontend that does 
not expect a jar file.


> Unable to start Python API using ./bin/pyflink*.sh
> --
>
> Key: FLINK-2037
> URL: https://issues.apache.org/jira/browse/FLINK-2037
> Project: Flink
>  Issue Type: Bug
>  Components: Python API
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
>
> Calling {{./bin/pyflink3.sh}} will lead to
> {code}
> ./bin/pyflink3.sh
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> JAR file does not exist: 
> /home/robert/incubator-flink/build-target/bin/../lib/flink-python-0.9-SNAPSHOT.jar
> Use the help option (-h or --help) to get help on the command.
> {code}
> This is due to the script expecting a {{flink-python-0.9-SNAPSHOT.jar}} file 
> to exist in {{lib}} (its wrong anyways that the version name is included 
> here. That should be replaced by a {{*}}).
> I'll look into the issue ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Implemented TwitterSourceFilter and adapted Tw...

2015-05-19 Thread HilmiYildirim
GitHub user HilmiYildirim opened a pull request:

https://github.com/apache/flink/pull/695

Implemented TwitterSourceFilter and adapted TwitterSource

I implemented the class TwitterFilterSource which inherits from 
TwitterSource. It enables to filter twitter streams, for example, by terms, 
languages and locations. Furthermore, you can add "followings" to the stream by 
adding user ids to the config. Then, the stream includes the tweets of these 
persons. Furthermore, it is possible to define query and post parameters.

I had to make some changes in the TwitterSource file. I had to change some 
variables and methods from private to protected. Furthermore, now TwitterSource 
inherits from RichSourceFunction instead of RichParallelSourceFunction. 
RichParallelSourceFunction does not make sense. With a twitter handle it is 
only possible to open 2 twitter streams in parallel. But if TwitterSource 
inherits from RichSourceFunction then only one stream will be opened and 
everything works fine.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HilmiYildirim/flink TwitterFilterSource

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/695.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #695


commit a89d3aec62a2f5ed3ec8246148ba289f777b9dbb
Author: yildirim 
Date:   2015-05-19T12:27:27Z

Implemented TwitterSourceFilter and adapted TwitterSource




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2037] Provide flink-python.jar in lib/

2015-05-19 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/691#issuecomment-103471022
  
I think this is a hacky fix.

Sounds like the right approach is to have a python cli frontend that does 
not expect a jar file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2026) Error message in count() only jobs

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550359#comment-14550359
 ] 

ASF GitHub Bot commented on FLINK-2026:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/686#issuecomment-103470201
  
Looks good, modulo the commas.

+1 to merge after fixing this.


> Error message in count() only jobs
> --
>
> Key: FLINK-2026
> URL: https://issues.apache.org/jira/browse/FLINK-2026
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Reporter: Sebastian Schelter
>Assignee: Maximilian Michels
>Priority: Minor
>
> If I run a job that only calls count() on a dataset (which is a valid data 
> flow IMHO), Flink executes the job but complains that no sinks are defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2026] add a flag to indicate previous e...

2015-05-19 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/686#issuecomment-103470201
  
Looks good, modulo the commas.

+1 to merge after fixing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2026) Error message in count() only jobs

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550340#comment-14550340
 ] 

ASF GitHub Bot commented on FLINK-2026:
---

Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/686#discussion_r30592879
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/ExecutionEnvironment.java ---
@@ -914,7 +917,15 @@ public JavaPlan createProgramPlan(String jobName) {
 */
public JavaPlan createProgramPlan(String jobName, boolean clearSinks) {
if (this.sinks.isEmpty()) {
-   throw new RuntimeException("No data sinks have been 
created yet. A program needs at least one sink that consumes data. Examples are 
writing the data set or printing it.");
+   if (wasExecuted) {
+   throw new RuntimeException("No new data sinks 
have been defined since the " +
+   "last execution. The last 
execution refers to the latest call to " +
+   "'execute()', 'count()' 
'collect()' or 'print()'.");
--- End diff --

missing commas


> Error message in count() only jobs
> --
>
> Key: FLINK-2026
> URL: https://issues.apache.org/jira/browse/FLINK-2026
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Reporter: Sebastian Schelter
>Assignee: Maximilian Michels
>Priority: Minor
>
> If I run a job that only calls count() on a dataset (which is a valid data 
> flow IMHO), Flink executes the job but complains that no sinks are defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2026] add a flag to indicate previous e...

2015-05-19 Thread uce
Github user uce commented on a diff in the pull request:

https://github.com/apache/flink/pull/686#discussion_r30592879
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/ExecutionEnvironment.java ---
@@ -914,7 +917,15 @@ public JavaPlan createProgramPlan(String jobName) {
 */
public JavaPlan createProgramPlan(String jobName, boolean clearSinks) {
if (this.sinks.isEmpty()) {
-   throw new RuntimeException("No data sinks have been 
created yet. A program needs at least one sink that consumes data. Examples are 
writing the data set or printing it.");
+   if (wasExecuted) {
+   throw new RuntimeException("No new data sinks 
have been defined since the " +
+   "last execution. The last 
execution refers to the latest call to " +
+   "'execute()', 'count()' 
'collect()' or 'print()'.");
--- End diff --

missing commas


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2026) Error message in count() only jobs

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550337#comment-14550337
 ] 

ASF GitHub Bot commented on FLINK-2026:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/686#issuecomment-103463683
  
Pull request updated.


> Error message in count() only jobs
> --
>
> Key: FLINK-2026
> URL: https://issues.apache.org/jira/browse/FLINK-2026
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Reporter: Sebastian Schelter
>Assignee: Maximilian Michels
>Priority: Minor
>
> If I run a job that only calls count() on a dataset (which is a valid data 
> flow IMHO), Flink executes the job but complains that no sinks are defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1992) Add convergence criterion to SGD optimizer

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550335#comment-14550335
 ] 

ASF GitHub Bot commented on FLINK-1992:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/692#discussion_r30592697
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/optimization/GradientDescentITSuite.scala
 ---
@@ -205,6 +190,61 @@ class GradientDescentITSuite extends FlatSpec with 
Matchers with FlinkTestBase {
 
   }
 
-  // TODO: Need more corner cases
+  it should "terminate early if the convergence criterion is reached" in {
+// TODO(tvas): We need a better way to check the convergence of the 
weights.
+// Ideally we want to have a Breeze-like system, where the optimizers 
carry a history and that
+// can tell us whether we have converged and at which iteration
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+env.setParallelism(2)
+
+val sgdEarlyTerminate = GradientDescent()
+  .setConvergenceThreshold(1e2)
+  .setStepsize(1.0)
+  .setIterations(800)
+  .setLossFunction(SquaredLoss())
+  .setRegularizationType(NoRegularization())
+  .setRegularizationParameter(0.0)
+
+val inputDS = env.fromCollection(data)
+
+val weightDSEarlyTerminate = sgdEarlyTerminate.optimize(inputDS, None)
+
+val weightListEarly: Seq[WeightVector] = 
weightDSEarlyTerminate.collect()
+
+weightListEarly.size should equal(1)
+
+val weightVectorEarly: WeightVector = weightListEarly.head
+val weightsEarly = 
weightVectorEarly.weights.asInstanceOf[DenseVector].data
+val weight0Early = weightVectorEarly.intercept
+
+val sgdNoConvergence = GradientDescent()
+  .setStepsize(1.0)
--- End diff --

Here we get a problem with the return type, we should be returning the 
runtime type, if we call setLossFunction first we get back a Solver, which 
means we can no longer call the IterativeSolver methods.


> Add convergence criterion to SGD optimizer
> --
>
> Key: FLINK-1992
> URL: https://issues.apache.org/jira/browse/FLINK-1992
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Theodore Vasiloudis
>  Labels: ML
>
> Currently, Flink's SGD optimizer runs for a fixed number of iterations. It 
> would be good to support a dynamic convergence criterion, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2026] add a flag to indicate previous e...

2015-05-19 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/686#issuecomment-103463683
  
Pull request updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1992] [ml] Add convergence criterion to...

2015-05-19 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/692#discussion_r30592697
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/optimization/GradientDescentITSuite.scala
 ---
@@ -205,6 +190,61 @@ class GradientDescentITSuite extends FlatSpec with 
Matchers with FlinkTestBase {
 
   }
 
-  // TODO: Need more corner cases
+  it should "terminate early if the convergence criterion is reached" in {
+// TODO(tvas): We need a better way to check the convergence of the 
weights.
+// Ideally we want to have a Breeze-like system, where the optimizers 
carry a history and that
+// can tell us whether we have converged and at which iteration
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+env.setParallelism(2)
+
+val sgdEarlyTerminate = GradientDescent()
+  .setConvergenceThreshold(1e2)
+  .setStepsize(1.0)
+  .setIterations(800)
+  .setLossFunction(SquaredLoss())
+  .setRegularizationType(NoRegularization())
+  .setRegularizationParameter(0.0)
+
+val inputDS = env.fromCollection(data)
+
+val weightDSEarlyTerminate = sgdEarlyTerminate.optimize(inputDS, None)
+
+val weightListEarly: Seq[WeightVector] = 
weightDSEarlyTerminate.collect()
+
+weightListEarly.size should equal(1)
+
+val weightVectorEarly: WeightVector = weightListEarly.head
+val weightsEarly = 
weightVectorEarly.weights.asInstanceOf[DenseVector].data
+val weight0Early = weightVectorEarly.intercept
+
+val sgdNoConvergence = GradientDescent()
+  .setStepsize(1.0)
--- End diff --

Here we get a problem with the return type, we should be returning the 
runtime type, if we call setLossFunction first we get back a Solver, which 
means we can no longer call the IterativeSolver methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2025) Support booleans in CSV reader

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550328#comment-14550328
 ] 

ASF GitHub Bot commented on FLINK-2025:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/685#issuecomment-103461860
  
- Rebased to current master.
- Now comparing on byte array instead of creating a String object.
- All checks are performed case-insensitive.


> Support booleans in CSV reader
> --
>
> Key: FLINK-2025
> URL: https://issues.apache.org/jira/browse/FLINK-2025
> Project: Flink
>  Issue Type: New Feature
>  Components: Core
>Reporter: Sebastian Schelter
>Assignee: Maximilian Michels
>
> It would be great if Flink allowed to read booleans from CSV files, e.g. 1 
> for true and 0 for false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2025] add support for booleans in csv p...

2015-05-19 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/685#issuecomment-103461860
  
- Rebased to current master.
- Now comparing on byte array instead of creating a String object.
- All checks are performed case-insensitive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (FLINK-2047) Rename CoCoA to SVM

2015-05-19 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2047:
--

 Summary: Rename CoCoA to SVM
 Key: FLINK-2047
 URL: https://issues.apache.org/jira/browse/FLINK-2047
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Trivial
 Fix For: 0.9


The CoCoA algorithm as implemented functions as an SVM classifier.

As CoCoA mostly concerns the optimization process and not the actual learning 
algorithm, it makes sense to rename the learner to SVM which users are more 
familiar with.

In the future we would like to use the CoCoA algorithm to solve more large 
scale optimization problems for other learning algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2025) Support booleans in CSV reader

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550322#comment-14550322
 ] 

ASF GitHub Bot commented on FLINK-2025:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/685#discussion_r30592160
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/types/parser/BooleanValueParser.java 
---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.types.parser;
+
+import org.apache.flink.types.BooleanValue;
+
+public class BooleanValueParser extends FieldParser {
+
+   private BooleanParser parser = new BooleanParser();
+
+   private BooleanValue result;
+
+   @Override
+   public int parseField(byte[] bytes, int startPos, int limit, byte[] 
delim, BooleanValue reuse) {
+   int returnValue = parser.parseField(bytes, startPos, limit, 
delim, reuse.getValue());
+   setErrorState(parser.getErrorState());
+   result = new BooleanValue(parser.result);
--- End diff --

Ok, I changed that.


> Support booleans in CSV reader
> --
>
> Key: FLINK-2025
> URL: https://issues.apache.org/jira/browse/FLINK-2025
> Project: Flink
>  Issue Type: New Feature
>  Components: Core
>Reporter: Sebastian Schelter
>Assignee: Maximilian Michels
>
> It would be great if Flink allowed to read booleans from CSV files, e.g. 1 
> for true and 0 for false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2025] add support for booleans in csv p...

2015-05-19 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/685#discussion_r30592160
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/types/parser/BooleanValueParser.java 
---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.types.parser;
+
+import org.apache.flink.types.BooleanValue;
+
+public class BooleanValueParser extends FieldParser {
+
+   private BooleanParser parser = new BooleanParser();
+
+   private BooleanValue result;
+
+   @Override
+   public int parseField(byte[] bytes, int startPos, int limit, byte[] 
delim, BooleanValue reuse) {
+   int returnValue = parser.parseField(bytes, startPos, limit, 
delim, reuse.getValue());
+   setErrorState(parser.getErrorState());
+   result = new BooleanValue(parser.result);
--- End diff --

Ok, I changed that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2025) Support booleans in CSV reader

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550321#comment-14550321
 ] 

ASF GitHub Bot commented on FLINK-2025:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/685#discussion_r30592154
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/types/parser/BooleanParser.java ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.types.parser;
+
+public class BooleanParser extends FieldParser {
+
+   protected boolean result;
+
+   @Override
+   public int parseField(byte[] bytes, int startPos, int limit, byte[] 
delim, Boolean reuse) {
+   int i = startPos;
+
+   final int delimLimit = limit - delim.length + 1;
+
+   while (i < limit) {
+   if (i < delimLimit && delimiterNext(bytes, i, delim)) {
+   break;
+   }
+   i++;
+   }
+
+   String str = new String(bytes, startPos, i - startPos);
--- End diff --

Not sure how much performance gain we can expect, but I changed the pull 
request to work directly on the byte arrays.


> Support booleans in CSV reader
> --
>
> Key: FLINK-2025
> URL: https://issues.apache.org/jira/browse/FLINK-2025
> Project: Flink
>  Issue Type: New Feature
>  Components: Core
>Reporter: Sebastian Schelter
>Assignee: Maximilian Michels
>
> It would be great if Flink allowed to read booleans from CSV files, e.g. 1 
> for true and 0 for false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2025] add support for booleans in csv p...

2015-05-19 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/685#discussion_r30592154
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/types/parser/BooleanParser.java ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.types.parser;
+
+public class BooleanParser extends FieldParser {
+
+   protected boolean result;
+
+   @Override
+   public int parseField(byte[] bytes, int startPos, int limit, byte[] 
delim, Boolean reuse) {
+   int i = startPos;
+
+   final int delimLimit = limit - delim.length + 1;
+
+   while (i < limit) {
+   if (i < delimLimit && delimiterNext(bytes, i, delim)) {
+   break;
+   }
+   i++;
+   }
+
+   String str = new String(bytes, startPos, i - startPos);
--- End diff --

Not sure how much performance gain we can expect, but I changed the pull 
request to work directly on the byte arrays.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implemented the class TwitterFilterSource

2015-05-19 Thread HilmiYildirim
Github user HilmiYildirim commented on the pull request:

https://github.com/apache/flink/pull/687#issuecomment-103460862
  
I do another Pull Request


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Implemented the class TwitterFilterSource

2015-05-19 Thread HilmiYildirim
Github user HilmiYildirim closed the pull request at:

https://github.com/apache/flink/pull/687


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1992) Add convergence criterion to SGD optimizer

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550311#comment-14550311
 ] 

ASF GitHub Bot commented on FLINK-1992:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/692#discussion_r30591734
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/optimization/GradientDescent.scala
 ---
@@ -76,86 +77,163 @@ class GradientDescent(runParameters: ParameterMap) 
extends IterativeSolver {
 }.withBroadcastSet(currentWeights, WEIGHTVECTOR_BROADCAST)
   }
 
+
+
   /** Provides a solution for the given optimization problem
 *
 * @param data A Dataset of LabeledVector (label, features) pairs
-* @param initWeights The initial weights that will be optimized
+* @param initialWeights The initial weights that will be optimized
 * @return The weights, optimized for the provided data.
 */
   override def optimize(
 data: DataSet[LabeledVector],
-initWeights: Option[DataSet[WeightVector]]): DataSet[WeightVector] = {
-// TODO: Faster way to do this?
-val dimensionsDS = data.map(_.vector.size).reduce((a, b) => b)
-
-val numberOfIterations: Int = parameterMap(Iterations)
+initialWeights: Option[DataSet[WeightVector]]): DataSet[WeightVector] 
= {
+val numberOfIterations: Int = parameters(Iterations)
+// TODO(tvas): This looks out of place, why don't we get back an 
Option from
+// parameters(ConvergenceThreshold)?
+val convergenceThresholdOption = parameters.get(ConvergenceThreshold)
 
 // Initialize weights
-val initialWeightsDS: DataSet[WeightVector] = initWeights match {
-  // Ensure provided weight vector is a DenseVector
-  case Some(wvDS) => {
-wvDS.map{wv => {
-  val denseWeights = wv.weights match {
-case dv: DenseVector => dv
-case sv: SparseVector => sv.toDenseVector
+val initialWeightsDS: DataSet[WeightVector] = 
createInitialWeightsDS(initialWeights, data)
+
+// Perform the iterations
+val optimizedWeights = convergenceThresholdOption match {
+  // No convergence criterion
+  case None =>
+initialWeightsDS.iterate(numberOfIterations) {
+  weightVectorDS => {
+SGDStep(data, weightVectorDS)
   }
-  WeightVector(denseWeights, wv.intercept)
 }
-
+  case Some(convergence) =>
+/** Calculates the regularized loss, from the data and given 
weights **/
+def lossCalculation(data: DataSet[LabeledVector], weightDS: 
DataSet[WeightVector]):
+DataSet[Double] = {
+  data.map {
+new LossCalculation
+  }.withBroadcastSet(weightDS, WEIGHTVECTOR_BROADCAST)
+.reduce {
+(left, right) =>
+  val (leftLoss, leftCount) = left
+  val (rightLoss, rightCount) = right
+  (leftLoss + rightLoss, rightCount + leftCount)
+  }
+.map{new RegularizedLossCalculation}
+.withBroadcastSet(weightDS, WEIGHTVECTOR_BROADCAST)
 }
-  }
-  case None => createInitialWeightVector(dimensionsDS)
-}
-
-// Perform the iterations
-// TODO: Enable convergence stopping criterion, as in Multiple Linear 
regression
-initialWeightsDS.iterate(numberOfIterations) {
-  weightVector => {
-SGDStep(data, weightVector)
-  }
+// We have to calculate for each weight vector the sum of squared 
residuals,
+// and then sum them and apply regularization
+val initialLossSumDS = lossCalculation(data, initialWeightsDS)
+
+// Combine weight vector with the current loss
+val initialWeightsWithLossSum = initialWeightsDS.
+  crossWithTiny(initialLossSumDS).setParallelism(1)
+
+val resultWithLoss = initialWeightsWithLossSum.
+  iterateWithTermination(numberOfIterations) {
+  weightsWithLossSum =>
+
+// Extract weight vector and loss
+val previousWeightsDS = weightsWithLossSum.map{_._1}
+val previousLossSumDS = weightsWithLossSum.map{_._2}
+
+val currentWeightsDS = SGDStep(data, previousWeightsDS)
+
+val currentLossSumDS = lossCalculation(data, currentWeightsDS)
+
+// Check if the relative change in the loss is smaller than the
+// convergence threshold. If yes, then terminate i.e. return 
empty termination data set
+val termination = 
previousLossSumDS.crossWi

[GitHub] flink pull request: [FLINK-1992] [ml] Add convergence criterion to...

2015-05-19 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/692#discussion_r30591734
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/optimization/GradientDescent.scala
 ---
@@ -76,86 +77,163 @@ class GradientDescent(runParameters: ParameterMap) 
extends IterativeSolver {
 }.withBroadcastSet(currentWeights, WEIGHTVECTOR_BROADCAST)
   }
 
+
+
   /** Provides a solution for the given optimization problem
 *
 * @param data A Dataset of LabeledVector (label, features) pairs
-* @param initWeights The initial weights that will be optimized
+* @param initialWeights The initial weights that will be optimized
 * @return The weights, optimized for the provided data.
 */
   override def optimize(
 data: DataSet[LabeledVector],
-initWeights: Option[DataSet[WeightVector]]): DataSet[WeightVector] = {
-// TODO: Faster way to do this?
-val dimensionsDS = data.map(_.vector.size).reduce((a, b) => b)
-
-val numberOfIterations: Int = parameterMap(Iterations)
+initialWeights: Option[DataSet[WeightVector]]): DataSet[WeightVector] 
= {
+val numberOfIterations: Int = parameters(Iterations)
+// TODO(tvas): This looks out of place, why don't we get back an 
Option from
+// parameters(ConvergenceThreshold)?
+val convergenceThresholdOption = parameters.get(ConvergenceThreshold)
 
 // Initialize weights
-val initialWeightsDS: DataSet[WeightVector] = initWeights match {
-  // Ensure provided weight vector is a DenseVector
-  case Some(wvDS) => {
-wvDS.map{wv => {
-  val denseWeights = wv.weights match {
-case dv: DenseVector => dv
-case sv: SparseVector => sv.toDenseVector
+val initialWeightsDS: DataSet[WeightVector] = 
createInitialWeightsDS(initialWeights, data)
+
+// Perform the iterations
+val optimizedWeights = convergenceThresholdOption match {
+  // No convergence criterion
+  case None =>
+initialWeightsDS.iterate(numberOfIterations) {
+  weightVectorDS => {
+SGDStep(data, weightVectorDS)
   }
-  WeightVector(denseWeights, wv.intercept)
 }
-
+  case Some(convergence) =>
+/** Calculates the regularized loss, from the data and given 
weights **/
+def lossCalculation(data: DataSet[LabeledVector], weightDS: 
DataSet[WeightVector]):
+DataSet[Double] = {
+  data.map {
+new LossCalculation
+  }.withBroadcastSet(weightDS, WEIGHTVECTOR_BROADCAST)
+.reduce {
+(left, right) =>
+  val (leftLoss, leftCount) = left
+  val (rightLoss, rightCount) = right
+  (leftLoss + rightLoss, rightCount + leftCount)
+  }
+.map{new RegularizedLossCalculation}
+.withBroadcastSet(weightDS, WEIGHTVECTOR_BROADCAST)
 }
-  }
-  case None => createInitialWeightVector(dimensionsDS)
-}
-
-// Perform the iterations
-// TODO: Enable convergence stopping criterion, as in Multiple Linear 
regression
-initialWeightsDS.iterate(numberOfIterations) {
-  weightVector => {
-SGDStep(data, weightVector)
-  }
+// We have to calculate for each weight vector the sum of squared 
residuals,
+// and then sum them and apply regularization
+val initialLossSumDS = lossCalculation(data, initialWeightsDS)
+
+// Combine weight vector with the current loss
+val initialWeightsWithLossSum = initialWeightsDS.
+  crossWithTiny(initialLossSumDS).setParallelism(1)
+
+val resultWithLoss = initialWeightsWithLossSum.
+  iterateWithTermination(numberOfIterations) {
+  weightsWithLossSum =>
+
+// Extract weight vector and loss
+val previousWeightsDS = weightsWithLossSum.map{_._1}
+val previousLossSumDS = weightsWithLossSum.map{_._2}
+
+val currentWeightsDS = SGDStep(data, previousWeightsDS)
+
+val currentLossSumDS = lossCalculation(data, currentWeightsDS)
+
+// Check if the relative change in the loss is smaller than the
+// convergence threshold. If yes, then terminate i.e. return 
empty termination data set
+val termination = 
previousLossSumDS.crossWithTiny(currentLossSumDS).setParallelism(1).
+  filter{
+  pair => {
+val (previousLoss, currentLoss) = pair
+
+if (previousLoss <= 0) {
+  false
+  

[jira] [Commented] (FLINK-1992) Add convergence criterion to SGD optimizer

2015-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550294#comment-14550294
 ] 

ASF GitHub Bot commented on FLINK-1992:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/692#issuecomment-103457871
  
I think this is feature complete now, @tillrohrmann can take a look when 
you have time.


> Add convergence criterion to SGD optimizer
> --
>
> Key: FLINK-1992
> URL: https://issues.apache.org/jira/browse/FLINK-1992
> Project: Flink
>  Issue Type: Improvement
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Theodore Vasiloudis
>  Labels: ML
>
> Currently, Flink's SGD optimizer runs for a fixed number of iterations. It 
> would be good to support a dynamic convergence criterion, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >