date:20150608

[jira] [Updated] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1

2015-06-08 Thread Sachin Goel (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Goel updated FLINK-2187:
---
Affects Version/s: 0.9

 KMeans clustering is not present in release-0.9-rc1
 ---

 Key: FLINK-2187
 URL: https://issues.apache.org/jira/browse/FLINK-2187
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9
Reporter: Sachin Goel

 The Flink ml readme 
 [https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] 
 contains kmeans clustering in its description. However, the kmeans is not 
 part of the ML library yet. It is still only present in examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mjsax

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110096269
  
No worries :) I am used to it. And I think it's good to ensure high code 
quality!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577588#comment-14577588
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110096269
  
No worries :) I am used to it. And I think it's good to ensure high code 
quality!


 Allow comments in 'slaves' file
 ---

 Key: FLINK-2174
 URL: https://issues.apache.org/jira/browse/FLINK-2174
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Trivial
 Fix For: 0.10


 Currently, each line in slaves in interpreded as a host name. Scripts should 
 skip lines starting with '#'. Also allow for comments at the end of a line 
 and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1

2015-06-08 Thread Sachin Goel (JIRA)

Sachin Goel created FLINK-2187:
--

 Summary: KMeans clustering is not present in release-0.9-rc1
 Key: FLINK-2187
 URL: https://issues.apache.org/jira/browse/FLINK-2187
 Project: Flink
  Issue Type: Bug
Reporter: Sachin Goel


The Flink ml readme 
[https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] 
contains kmeans clustering in its description. However, the kmeans is not part 
of the ML library yet. It is still only present in examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577863#comment-14577863
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110143059
  
Hi @shghatge, you don't need to close a PR in order to update it.
You can simply update (push or push --force into) the branch from which you 
created the PR and Github will automatically update the PR. This helps to have 
all comments about your implementation in one place.


 Add a difference method to Gelly's Graph class
 --

 Key: FLINK-2093
 URL: https://issues.apache.org/jira/browse/FLINK-2093
 Project: Flink
  Issue Type: New Feature
  Components: Gelly
Affects Versions: 0.9
Reporter: Andra Lungu
Assignee: Shivani Ghatge
Priority: Minor

 This method will compute the difference between two graphs, returning a new 
 graph containing the vertices and edges that the current graph and the input 
 graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread fhueske

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110143059
  
Hi @shghatge, you don't need to close a PR in order to update it.
You can simply update (push or push --force into) the branch from which you 
created the PR and Github will automatically update the PR. This helps to have 
all comments about your implementation in one place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2185) Rework semantics for .setSeed function of SVM

2015-06-08 Thread Dmitriy Lyubimov (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577893#comment-14577893
 ] 

Dmitriy Lyubimov commented on FLINK-2185:
-

is svm work in Flik non-linear or linear only?

 Rework semantics for .setSeed function of SVM
 -

 Key: FLINK-2185
 URL: https://issues.apache.org/jira/browse/FLINK-2185
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor
  Labels: ML
 Fix For: 0.10


 Currently setting the seed for the SVM does not have the expected result of 
 producing the same output for each run of the algorithm, potentially due to 
 the randomness in data partitioning.
 We should then rework the way the algorithm works to either ensure we get the 
 exact same results when the seed is set, or if that is not possible the 
 setSeed function should be removed.
 Also in the current implementation the default value of 0 for the seed means 
 that all runs for which we don't set the seed will get the same seed which is 
 not the intended behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1601) Sometimes the YARNSessionFIFOITCase fails on Travis

2015-06-08 Thread Sachin Goel (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577841#comment-14577841
 ] 

Sachin Goel commented on FLINK-1601:


This test is failing again on travis build. The issue seems different however.
Here is the log:
https://api.travis-ci.org/jobs/65946506/log.txt?deansi=true


 Sometimes the YARNSessionFIFOITCase fails on Travis
 ---

 Key: FLINK-1601
 URL: https://issues.apache.org/jira/browse/FLINK-1601
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann
Assignee: Robert Metzger

 Sometimes the {{YARNSessionFIFOITCase}} fails on Travis with the following 
 exception.
 {code}
 Tests run: 8, Failures: 8, Errors: 0, Skipped: 0, Time elapsed: 71.375 sec 
  FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOITCase
 perJobYarnCluster(org.apache.flink.yarn.YARNSessionFIFOITCase)  Time elapsed: 
 60.707 sec   FAILURE!
 java.lang.AssertionError: During the timeout period of 60 seconds the 
 expected string did not show up
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.apache.flink.yarn.YarnTestBase.runWithArgs(YarnTestBase.java:315)
   at 
 org.apache.flink.yarn.YARNSessionFIFOITCase.perJobYarnCluster(YARNSessionFIFOITCase.java:185)
 testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase)  Time elapsed: 
 0.507 sec   FAILURE!
 java.lang.AssertionError: There is at least one application on the cluster is 
 not finished
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.flink.yarn.YarnTestBase.checkClusterEmpty(YarnTestBase.java:146)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
   at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
   at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   at 
 org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
   at 
 org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
   at 
 org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
   at 
 org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
   at 
 org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
   at 
 org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
 {code}
 The result is
 {code}
 Failed tests: 
   YARNSessionFIFOITCase.perJobYarnCluster:185-YarnTestBase.runWithArgs:315 
 During the timeout period of 60 seconds the expected string did not show up
   YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least 
 one application on the cluster is not finished
   YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least 
 one application on the cluster is not finished
   YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least 
 one application on the cluster is not finished
   YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least 
 one application on the cluster is not finished
   YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least 
 one application on the cluster is not finished
   YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least 
 one application on the cluster is not

[GitHub] flink pull request: [docs] Fix some typos and grammar in the Strea...

2015-06-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/806


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread shghatge

Github user shghatge commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110078878
  
Then it was just removing vertices! Talk about swatting a Fly with a 
Sledgehammer! I will do all the changes you suggested. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577486#comment-14577486
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user shghatge commented on the pull request:

https://github.com/apache/flink/pull/807#issuecomment-110078878
  
Then it was just removing vertices! Talk about swatting a Fly with a 
Sledgehammer! I will do all the changes you suggested. 


 Add a difference method to Gelly's Graph class
 --

 Key: FLINK-2093
 URL: https://issues.apache.org/jira/browse/FLINK-2093
 Project: Flink
  Issue Type: New Feature
  Components: Gelly
Affects Versions: 0.9
Reporter: Andra Lungu
Assignee: Shivani Ghatge
Priority: Minor

 This method will compute the difference between two graphs, returning a new 
 graph containing the vertices and edges that the current graph and the input 
 graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577487#comment-14577487
 ] 

ASF GitHub Bot commented on FLINK-2093:
---

Github user shghatge closed the pull request at:

https://github.com/apache/flink/pull/807


 Add a difference method to Gelly's Graph class
 --

 Key: FLINK-2093
 URL: https://issues.apache.org/jira/browse/FLINK-2093
 Project: Flink
  Issue Type: New Feature
  Components: Gelly
Affects Versions: 0.9
Reporter: Andra Lungu
Assignee: Shivani Ghatge
Priority: Minor

 This method will compute the difference between two graphs, returning a new 
 graph containing the vertices and edges that the current graph and the input 
 graph don't have in common. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method

2015-06-08 Thread shghatge

Github user shghatge closed the pull request at:

https://github.com/apache/flink/pull/807


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Resolved] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread Maximilian Michels (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels resolved FLINK-2174.
---
   Resolution: Implemented
Fix Version/s: 0.10

 Allow comments in 'slaves' file
 ---

 Key: FLINK-2174
 URL: https://issues.apache.org/jira/browse/FLINK-2174
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Trivial
 Fix For: 0.10


 Currently, each line in slaves in interpreded as a host name. Scripts should 
 skip lines starting with '#'. Also allow for comments at the end of a line 
 and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread mxm

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110076740
  
Thank you! Sorry for giving you a hard time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577479#comment-14577479
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/796#issuecomment-110076740
  
Thank you! Sorry for giving you a hard time.


 Allow comments in 'slaves' file
 ---

 Key: FLINK-2174
 URL: https://issues.apache.org/jira/browse/FLINK-2174
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Trivial

 Currently, each line in slaves in interpreded as a host name. Scripts should 
 skip lines starting with '#'. Also allow for comments at the end of a line 
 and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file

2015-06-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/796


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577482#comment-14577482
 ] 

ASF GitHub Bot commented on FLINK-2174:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/796


 Allow comments in 'slaves' file
 ---

 Key: FLINK-2174
 URL: https://issues.apache.org/jira/browse/FLINK-2174
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Trivial

 Currently, each line in slaves in interpreded as a host name. Scripts should 
 skip lines starting with '#'. Also allow for comments at the end of a line 
 and skip empty lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLINK-1916) EOFException when running delta-iteration job

2015-06-08 Thread Maximilian Michels (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-1916:
--
Affects Version/s: 0.10

 EOFException when running delta-iteration job
 -

 Key: FLINK-1916
 URL: https://issues.apache.org/jira/browse/FLINK-1916
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
 Environment: 0.9-milestone-1
 Exception on the cluster, local execution works
Reporter: Stefan Bunk

 The delta-iteration program in [1] ends with an
 java.io.EOFException
   at 
 org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333)
   at 
 org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140)
   at 
 org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223)
   at 
 org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291)
   at 
 org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307)
   at 
 org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62)
   at 
 org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26)
   at 
 org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89)
   at 
 org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29)
   at 
 org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219)
   at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
   at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
   at 
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
   at 
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
   at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
   at 
 org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
   at java.lang.Thread.run(Thread.java:745)
 For logs and the accompanying mailing list discussion see below.
 When running with slightly different memory configuration, as hinted on the 
 mailing list, I sometimes also get this exception:
 19.Apr. 13:39:29 INFO  Task - IterationHead(WorksetIteration 
 (Resolved-Redirects)) (10/10) switched to FAILED : 
 java.lang.IndexOutOfBoundsException: Index: 161, Size: 161
 at java.util.ArrayList.rangeCheck(ArrayList.java:635)
 at java.util.ArrayList.get(ArrayList.java:411)
 at 
 org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.resetTo(InMemoryPartition.java:352)
 at 
 org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.access$100(InMemoryPartition.java:301)
 at 
 org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:226)
 at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
 at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
 at 
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
 at 
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at 
 org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
 at java.lang.Thread.run(Thread.java:745)
 [1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57
 [2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4
 [3] One task manager's logs: https://gist.github.com/knub/8f2f953da95c8d7adefc
 [4] 
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/EOFException-when-running-Flink-job-td1092.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576642#comment-14576642
 ] 

ASF GitHub Bot commented on FLINK-1319:
---

Github user twalthr commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-109875609
  
Great news :) Thanks Ufuk!


 Add static code analysis for UDFs
 -

 Key: FLINK-1319
 URL: https://issues.apache.org/jira/browse/FLINK-1319
 Project: Flink
  Issue Type: New Feature
  Components: Java API, Scala API
Reporter: Stephan Ewen
Assignee: Timo Walther
Priority: Minor

 Flink's Optimizer takes information that tells it for UDFs which fields of 
 the input elements are accessed, modified, or frwarded/copied. This 
 information frequently helps to reuse partitionings, sorts, etc. It may speed 
 up programs significantly, as it can frequently eliminate sorts and shuffles, 
 which are costly.
 Right now, users can add lightweight annotations to UDFs to provide this 
 information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}.
 We worked with static code analysis of UDFs before, to determine this 
 information automatically. This is an incredible feature, as it magically 
 makes programs faster.
 For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this 
 works surprisingly well in many cases. We used the Soot toolkit for the 
 static code analysis. Unfortunately, Soot is LGPL licensed and thus we did 
 not include any of the code so far.
 I propose to add this functionality to Flink, in the form of a drop-in 
 addition, to work around the LGPL incompatibility with ALS 2.0. Users could 
 simply download a special flink-code-analysis.jar and drop it into the 
 lib folder to enable this functionality. We may even add a script to 
 tools that downloads that library automatically into the lib folder. This 
 should be legally fine, since we do not redistribute LGPL code and only 
 dynamically link it (the incompatibility with ASL 2.0 is mainly in the 
 patentability, if I remember correctly).
 Prior work on this has been done by [~aljoscha] and [~skunert], which could 
 provide a code base to start with.
 *Appendix*
 Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/
 Papers on static analysis and for optimization: 
 http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and 
 http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf
 Quick introduction to the Optimizer: 
 http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf 
 (Section 6)
 Optimizer for Iterations: 
 http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf 
 (Sections 4.3 and 5.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1319][core] Add static code analysis fo...

2015-06-08 Thread twalthr

Github user twalthr commented on the pull request:

https://github.com/apache/flink/pull/729#issuecomment-109875609
  
Great news :) Thanks Ufuk!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2080) Execute Flink with sbt

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576699#comment-14576699
 ] 

ASF GitHub Bot commented on FLINK-2080:
---

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/787#issuecomment-109893575
  
Merging.


 Execute Flink with sbt
 --

 Key: FLINK-2080
 URL: https://issues.apache.org/jira/browse/FLINK-2080
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.8.1
Reporter: Christian Wuertz
Assignee: Christian Wuertz
Priority: Minor
 Attachments: sbt.patch


 I tried to execute some of the flink example applications on my local machine 
 using sbt. To get this running without class loading issues it was important 
 to make sure that Flink is executed in its own JVM and not in the sbt JVM. 
 This can be done very easily, but it would have been nice to know that in 
 advance. So maybe you guys want to add this to the Flink documentation.
 An example can be found here: https://github.com/Teots/flink-sbt
 (The trick was to add fork in run := true to the build.sbt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-2181) SessionWindowing example has a memleak

2015-06-08 Thread Gabor Gevay (JIRA)

Gabor Gevay created FLINK-2181:
--

 Summary: SessionWindowing example has a memleak
 Key: FLINK-2181
 URL: https://issues.apache.org/jira/browse/FLINK-2181
 Project: Flink
  Issue Type: Bug
  Components: Examples, Streaming
Reporter: Gabor Gevay
Assignee: Gabor Gevay


The trigger policy objects belonging to already terminated sessions are kept in 
memory, and also NotifyOnLastGlobalElement gets called on them. This causes the 
program to eat up more and more memory, and also to get slower with time.

Mailing list discussion:
http://mail-archives.apache.org/mod_mbox/flink-dev/201505.mbox/%3CCADXjeyBW9uaA=ayde+9r+qh85pblyavuegsr7h8pwkyntm9...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2136] Adding DataStream tests for Scala...

2015-06-08 Thread gaborhermann

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/771#issuecomment-109894121
  
Thanks for merging. I removed the line that failed the test (I mistakenly 
checked a non-deterministic id).
https://github.com/gaborhermann/flink/commits/FLINK-2136


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2136) Test the streaming scala API

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576705#comment-14576705
 ] 

ASF GitHub Bot commented on FLINK-2136:
---

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/771#issuecomment-109894121
  
Thanks for merging. I removed the line that failed the test (I mistakenly 
checked a non-deterministic id).
https://github.com/gaborhermann/flink/commits/FLINK-2136


 Test the streaming scala API
 

 Key: FLINK-2136
 URL: https://issues.apache.org/jira/browse/FLINK-2136
 Project: Flink
  Issue Type: Test
  Components: Scala API, Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann

 There are no test covering the streaming scala API. I would suggest to test 
 whether the StreamGraph created by a certain operation looks as expected. 
 Deeper layers and runtime should not be tested here, that is done in 
 streaming-core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31894690
  
--- Diff: docs/libs/ml/minMax_scaler.md ---
@@ -0,0 +1,113 @@
+---
+mathjax: include
+htmlTitle: FlinkML - MinMax Scaler
+title: a href=../mlFlinkML/a - MinMax Scaler
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie 
between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value 
for the scaling range, the MinMax scaler transforms the features of the input 
data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min 
\right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and 
maximum values of the range to scale.
+
+## Operations
+
+`MinMaxScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T : Vector]: DataSet[T] = Unit`
+* `fit: DataSet[LabeledVector] = Unit`
+
+### Transform
+
+MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into 
the respective type:
+
+* `transform[T : Vector]: DataSet[T] = DataSet[T]`
+* `transform: DataSet[LabeledVector] = DataSet[LabeledVector]`
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two 
parameters:
+
+ table class=table table-bordered
+  thead
+tr
+  th class=text-left style=width: 20%Parameters/th
+  th class=text-centerDescription/th
+/tr
+  /thead
+
+  tbody
+tr
+  tdstrongMin/strong/td
+  td
+p
+  The minimum value of the range for the scaled data set. (Default 
value: strong0.0/strong)
+/p
+  /td
+/tr
+tr
+  tdstrongStd/strong/td
--- End diff --

Shouldn't this be called `Max`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2153] Exclude Hbase annotations

2015-06-08 Thread mbalassi

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/800#issuecomment-109893732
  
I prefer 2 warnings to a sometimes error. :) Merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2153) Exclude dependency on hbase annotations module

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576701#comment-14576701
 ] 

ASF GitHub Bot commented on FLINK-2153:
---

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/800#issuecomment-109893732
  
I prefer 2 warnings to a sometimes error. :) Merging.


 Exclude dependency on hbase annotations module
 --

 Key: FLINK-2153
 URL: https://issues.apache.org/jira/browse/FLINK-2153
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Márton Balassi

 [ERROR] Failed to execute goal on project flink-hbase: Could not resolve
 dependencies for project org.apache.flink:flink-hbase:jar:0.9-SNAPSHOT:
 Could not find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
 /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
 There is a Spark issue for this [1] with a solution [2].
 [1] https://issues.apache.org/jira/browse/SPARK-4455
 [2] https://github.com/apache/spark/pull/3286/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)

2015-06-08 Thread Nikolaas Steenbergen (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolaas Steenbergen reassigned FLINK-2161:
---

Assignee: Nikolaas Steenbergen

 Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
 --

 Key: FLINK-2161
 URL: https://issues.apache.org/jira/browse/FLINK-2161
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Nikolaas Steenbergen

 Currently, there is no easy way to load and ship external libraries/jars with 
 Flink's Scala shell. Assume that you want to run some Gelly graph algorithms 
 from within the Scala shell, then you have to put the Gelly jar manually in 
 the lib directory and make sure that this jar is also available on your 
 cluster, because it is not shipped with the user code. 
 It would be good to have a simple mechanism how to specify additional jars 
 upon startup of the Scala shell. These jars should then also be shipped to 
 the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: Add CONTRIBUTING.md file with pointers to our ...

2015-06-08 Thread mbalassi

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/778#issuecomment-109893654
  
Merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2080] Document how to use Flink with sb...

2015-06-08 Thread mbalassi

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/787#issuecomment-109893575
  
Merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2136) Test the streaming scala API

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576755#comment-14576755
 ] 

ASF GitHub Bot commented on FLINK-2136:
---

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/771#issuecomment-109909701
  
Of course. I took up the JIRA issue.


 Test the streaming scala API
 

 Key: FLINK-2136
 URL: https://issues.apache.org/jira/browse/FLINK-2136
 Project: Flink
  Issue Type: Test
  Components: Scala API, Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann

 There are no test covering the streaming scala API. I would suggest to test 
 whether the StreamGraph created by a certain operation looks as expected. 
 Deeper layers and runtime should not be tested here, that is done in 
 streaming-core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2136] Adding DataStream tests for Scala...

2015-06-08 Thread gaborhermann

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/771#issuecomment-109909701
  
Of course. I took up the JIRA issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2180) Streaming iterate test fails spuriously

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576758#comment-14576758
 ] 

ASF GitHub Bot commented on FLINK-2180:
---

GitHub user gaborhermann opened a pull request:

https://github.com/apache/flink/pull/802

[wip] [FLINK-2180] [streaming] Iteration test fix

* Fixed the iteration test at DataStreamTest
* Will add more tests for DataStream
* Will fix IterateTest

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gaborhermann/flink FLINK-2180

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/802.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #802


commit bf612f4e4133712fc454ab0417aa563e08a526b4
Author: Gábor Hermann reckone...@gmail.com
Date:   2015-06-08T07:37:40Z

[streaming] Added iteration test for DataStream




 Streaming iterate test fails spuriously
 ---

 Key: FLINK-2180
 URL: https://issues.apache.org/jira/browse/FLINK-2180
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann

 Following output seen occasionally: 
 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.667 sec  
 FAILURE! - in org.apache.flink.streaming.api.IterateTest
 test(org.apache.flink.streaming.api.IterateTest)  Time elapsed: 3.662 sec  
  FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at org.apache.flink.streaming.api.IterateTest.test(IterateTest.java:154)
 See: https://travis-ci.org/mbalassi/flink/jobs/65803465



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895785
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{DenseVector, Vector}
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+
+class MinMaxScalerITSuite
+  extends FlatSpec
+  with Matchers
+  with FlinkTestBase {
+
+  behavior of Flink's MinMax Scaler
+
+  import MinMaxScalerData._
+
+  it should scale the vectors' values to be restricted in the (0.0,1.0) 
range in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data)
+val minMaxScaler = MinMaxScaler()
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data.length)
+
+for (vector - scaledVectors) {
+  val test = vector.asBreeze.forall(fv = {
+fv = 0.0  fv = 1.0
--- End diff --

Maybe we could not only compare whether the data lies between `0` and `1` 
but also whether the vectors have been correctly scaled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1981] add support for GZIP files

2015-06-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/762


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-1981) Add GZip support

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576760#comment-14576760
 ] 

ASF GitHub Bot commented on FLINK-1981:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/762#issuecomment-109912571
  
Thank you for your contribution.


 Add GZip support
 

 Key: FLINK-1981
 URL: https://issues.apache.org/jira/browse/FLINK-1981
 Project: Flink
  Issue Type: New Feature
  Components: Core
Reporter: Sebastian Kruse
Assignee: Sebastian Kruse
Priority: Minor

 GZip, as a commonly used compression format, should be supported in the same 
 way as the already supported deflate files. This allows to use GZip files 
 with any subclass of FileInputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1981) Add GZip support

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576759#comment-14576759
 ] 

ASF GitHub Bot commented on FLINK-1981:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/762


 Add GZip support
 

 Key: FLINK-1981
 URL: https://issues.apache.org/jira/browse/FLINK-1981
 Project: Flink
  Issue Type: New Feature
  Components: Core
Reporter: Sebastian Kruse
Assignee: Sebastian Kruse
Priority: Minor

 GZip, as a commonly used compression format, should be supported in the same 
 way as the already supported deflate files. This allows to use GZip files 
 with any subclass of FileInputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-1981] add support for GZIP files

2015-06-08 Thread mxm

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/762#issuecomment-109912571
  
Thank you for your contribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895797
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{DenseVector, Vector}
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+
+class MinMaxScalerITSuite
+  extends FlatSpec
+  with Matchers
+  with FlinkTestBase {
+
+  behavior of Flink's MinMax Scaler
+
+  import MinMaxScalerData._
+
+  it should scale the vectors' values to be restricted in the (0.0,1.0) 
range in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data)
+val minMaxScaler = MinMaxScaler()
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data.length)
+
+for (vector - scaledVectors) {
+  val test = vector.asBreeze.forall(fv = {
+fv = 0.0  fv = 1.0
+  })
+  test shouldEqual (true)
+}
+
+  }
+
+  it should scale vectors' values in the (-1.0,1.0) range in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data2)
+val minMaxScaler = MinMaxScaler().setMin(-1.0).setMax(1.0)
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data2.length)
+
+for (labeledVector - scaledVectors) {
+  val test = labeledVector.vector.asBreeze.forall(lv = {
+lv = -1.0  lv = 1.0
--- End diff --

The same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576766#comment-14576766
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896248
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
--- End diff --

are the inputs called predictors?


 Add a quickstart guide for FlinkML
 --

 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
 Fix For: 0.9


 We need a quickstart guide that introduces users to the core concepts of 
 FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576773#comment-14576773
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896759
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896733
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896759
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896845
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897514
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
--- End diff --

Will add reference


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576799#comment-14576799
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897535
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
--- End diff --

Good catch, will add.


 Add a quickstart guide for FlinkML
 --

 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
 Fix For: 0.9


 We need a quickstart guide that introduces users to the core concepts of 
 FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897546
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
--- End diff --

Yup, will remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Closed] (FLINK-2080) Execute Flink with sbt

2015-06-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/FLINK-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Márton Balassi closed FLINK-2080.
-
   Resolution: Fixed
Fix Version/s: 0.9

Implemented via 3e9af82.

 Execute Flink with sbt
 --

 Key: FLINK-2080
 URL: https://issues.apache.org/jira/browse/FLINK-2080
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 0.8.1
Reporter: Christian Wuertz
Assignee: Christian Wuertz
Priority: Minor
 Fix For: 0.9

 Attachments: sbt.patch


 I tried to execute some of the flink example applications on my local machine 
 using sbt. To get this running without class loading issues it was important 
 to make sure that Flink is executed in its own JVM and not in the sbt JVM. 
 This can be done very easily, but it would have been nice to know that in 
 advance. So maybe you guys want to add this to the Flink documentation.
 An example can be found here: https://github.com/Teots/flink-sbt
 (The trick was to add fork in run := true to the build.sbt)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576798#comment-14576798
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897514
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
--- End diff --

Will add reference


 Add a quickstart guide for FlinkML
 --

 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
 Fix For: 0.9


 We need a quickstart guide that introduces users to the core concepts of 
 FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897535
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
--- End diff --

Good catch, will add.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897818
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[GitHub] flink pull request: [FLINK-2116] [ml] Reusing predict operation fo...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/772#issuecomment-109926023
  
I created a [JIRA 
ticket](https://issues.apache.org/jira/browse/FLINK-2157?filter=12331885) for 
the evaluation of predictors. I think we should deal with it separately in 
order to keep the PR small and reviewable. But I agree, this is an important 
feature for any ML library.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31899145
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{DenseVector, Vector}
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+
+class MinMaxScalerITSuite
+  extends FlatSpec
+  with Matchers
+  with FlinkTestBase {
+
+  behavior of Flink's MinMax Scaler
+
+  import MinMaxScalerData._
+
+  it should scale the vectors' values to be restricted in the (0.0,1.0) 
range in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data)
+val minMaxScaler = MinMaxScaler()
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data.length)
+
+for (vector - scaledVectors) {
+  val test = vector.asBreeze.forall(fv = {
+fv = 0.0  fv = 1.0
--- End diff --

Yes it is. This assures that if someone changes something of the 
`Transformer` logic, then he will see an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Clean up naming of State/Checkpoint Interfaces

2015-06-08 Thread mbalassi

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/671#issuecomment-109933236
  
Sorry, I thought we had this mostly in already. My bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576999#comment-14576999
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902243
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply

[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576998#comment-14576998
 ] 

ASF GitHub Bot commented on FLINK-2054:
---

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/803#issuecomment-109944731
  
Ok, I'm working on it.


 StreamOperator rework removed copy calls when passing output to a chained 
 operator
 --

 Key: FLINK-2054
 URL: https://issues.apache.org/jira/browse/FLINK-2054
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Gyula Fora
Assignee: Aljoscha Krettek
Priority: Blocker

 Before the recent rework of stream operators to be push based, operators held 
 the semantics that any input (and also output to be specific) will not be 
 mutated afterwards. This was achieved by simply copying records that were 
 passed to other chained operators.
 This feature has been removed thus introducing a major break in the operator 
 mutability guarantees. 
 To make chaining viable in all cases (and to prevent hidden bugs) we need to 
 reintroduce the copying logic. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577040#comment-14577040
 ] 

ASF GitHub Bot commented on FLINK-2054:
---

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/803#issuecomment-109955810
  
I updated it. By default copying is on now. System operators can request to 
override the reuse setting.



 StreamOperator rework removed copy calls when passing output to a chained 
 operator
 --

 Key: FLINK-2054
 URL: https://issues.apache.org/jira/browse/FLINK-2054
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Gyula Fora
Assignee: Aljoscha Krettek
Priority: Blocker

 Before the recent rework of stream operators to be push based, operators held 
 the semantics that any input (and also output to be specific) will not be 
 mutated afterwards. This was achieved by simply copying records that were 
 passed to other chained operators.
 This feature has been removed thus introducing a major break in the operator 
 mutability guarantees. 
 To make chaining viable in all cases (and to prevent hidden bugs) we need to 
 reintroduce the copying logic. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...

2015-06-08 Thread aljoscha

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/803#issuecomment-109955810
  
I updated it. By default copying is on now. System operators can request to 
override the reuse setting.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread fobeligi

Github user fobeligi commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31894794
  
--- Diff: docs/libs/ml/minMax_scaler.md ---
@@ -0,0 +1,113 @@
+---
+mathjax: include
+htmlTitle: FlinkML - MinMax Scaler
+title: a href=../mlFlinkML/a - MinMax Scaler
+---
+!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+License); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+--
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie 
between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value 
for the scaling range, the MinMax scaler transforms the features of the input 
data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min 
\right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and 
maximum values of the range to scale.
+
+## Operations
+
+`MinMaxScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T : Vector]: DataSet[T] = Unit`
+* `fit: DataSet[LabeledVector] = Unit`
+
+### Transform
+
+MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into 
the respective type:
+
+* `transform[T : Vector]: DataSet[T] = DataSet[T]`
+* `transform: DataSet[LabeledVector] = DataSet[LabeledVector]`
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two 
parameters:
+
+ table class=table table-bordered
+  thead
+tr
+  th class=text-left style=width: 20%Parameters/th
+  th class=text-centerDescription/th
+/tr
+  /thead
+
+  tbody
+tr
+  tdstrongMin/strong/td
+  td
+p
+  The minimum value of the range for the scaled data set. (Default 
value: strong0.0/strong)
+/p
+  /td
+/tr
+tr
+  tdstrongStd/strong/td
--- End diff --

Yes, you are right!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31894783
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = (0,1).
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
+*/
+  implicit def fitVectorMinMaxScaler[T : Vector] = new 
FitOperation[MinMaxScaler,

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/798#issuecomment-109913485
  
Really good work @fobeligi. The code is really well structured and 
documented. I had only some minor comments. When you have them addressed, I 
think it's good to be merged :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896248
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
--- End diff --

are the inputs called predictors?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576778#comment-14576778
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896948
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896965
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576779#comment-14576779
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896965
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896996
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576795#comment-14576795
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897426
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
--- End diff --

It's more of a statistics terminology, see 
[synonyms](http://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms).
 In ML features is more common so I will change it to that.


 Add a quickstart guide for FlinkML
 --

 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
 Fix For: 0.9


 We need a quickstart guide that introduces users to the core concepts of 
 FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897426
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
--- End diff --

It's more of a statistics terminology, see 
[synonyms](http://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms).
 In ML features is more common so I will change it to that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576818#comment-14576818
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31898968
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply

[GitHub] flink pull request: Clean up naming of State/Checkpoint Interfaces

2015-06-08 Thread aljoscha

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/671#issuecomment-109932208
  
Why? There is still a naming difference between methods that to similar 
things at different levels.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...

2015-06-08 Thread gaborhermann

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/803#issuecomment-109955074
  
As a note: I've just run into this issue at chained streaming aggregations 
when writing tests (not copying created wrong results). I will check again 
after this has been solved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577035#comment-14577035
 ] 

ASF GitHub Bot commented on FLINK-2054:
---

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/803#issuecomment-109955074
  
As a note: I've just run into this issue at chained streaming aggregations 
when writing tests (not copying created wrong results). I will check again 
after this has been solved.


 StreamOperator rework removed copy calls when passing output to a chained 
 operator
 --

 Key: FLINK-2054
 URL: https://issues.apache.org/jira/browse/FLINK-2054
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Gyula Fora
Assignee: Aljoscha Krettek
Priority: Blocker

 Before the recent rework of stream operators to be push based, operators held 
 the semantics that any input (and also output to be specific) will not be 
 mutated afterwards. This was achieved by simply copying records that were 
 passed to other chained operators.
 This feature has been removed thus introducing a major break in the operator 
 mutability guarantees. 
 To make chaining viable in all cases (and to prevent hidden bugs) we need to 
 reintroduce the copying logic. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895117
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = (0,1).
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
+*/
+  implicit def fitVectorMinMaxScaler[T : Vector] = new 
FitOperation[MinMaxScaler,

[GitHub] flink pull request: [FLINK-2136] Adding DataStream tests for Scala...

2015-06-08 Thread mbalassi

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/771#issuecomment-109907919
  
Please open a PR with it. Could you give FLINK-2180 a try in the meantime 
please? It looks similar on paper. You can add then to the same pull request of 
course. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2136) Test the streaming scala API

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576747#comment-14576747
 ] 

ASF GitHub Bot commented on FLINK-2136:
---

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/771#issuecomment-109907919
  
Please open a PR with it. Could you give FLINK-2180 a try in the meantime 
please? It looks similar on paper. You can add then to the same pull request of 
course. :)


 Test the streaming scala API
 

 Key: FLINK-2136
 URL: https://issues.apache.org/jira/browse/FLINK-2136
 Project: Flink
  Issue Type: Test
  Components: Scala API, Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann

 There are no test covering the streaming scala API. I would suggest to test 
 whether the StreamGraph created by a certain operation looks as expected. 
 Deeper layers and runtime should not be tested here, that is done in 
 streaming-core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (FLINK-2180) Streaming iterate test fails spuriously

2015-06-08 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/FLINK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gábor Hermann reassigned FLINK-2180:


Assignee: Gábor Hermann

 Streaming iterate test fails spuriously
 ---

 Key: FLINK-2180
 URL: https://issues.apache.org/jira/browse/FLINK-2180
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann

 Following output seen occasionally: 
 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.667 sec  
 FAILURE! - in org.apache.flink.streaming.api.IterateTest
 test(org.apache.flink.streaming.api.IterateTest)  Time elapsed: 3.662 sec  
  FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at org.apache.flink.streaming.api.IterateTest.test(IterateTest.java:154)
 See: https://travis-ci.org/mbalassi/flink/jobs/65803465



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576762#comment-14576762
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896076
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
--- End diff --

period missing


 Add a quickstart guide for FlinkML
 --

 Key: FLINK-2072
 URL: https://issues.apache.org/jira/browse/FLINK-2072
 Project: Flink
  Issue Type: New Feature
  Components: Documentation, Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
 Fix For: 0.9


 We need a quickstart guide that introduces users to the core concepts of 
 FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576770#comment-14576770
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896535
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896535
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576780#comment-14576780
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896996
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897731
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576803#comment-14576803
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897731
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply

[jira] [Resolved] (FLINK-1685) Document how to read gzip/compressed files with Flink

2015-06-08 Thread Maximilian Michels (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels resolved FLINK-1685.
---
   Resolution: Fixed
Fix Version/s: 0.9

 Document how to read gzip/compressed files with Flink
 -

 Key: FLINK-1685
 URL: https://issues.apache.org/jira/browse/FLINK-1685
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Robert Metzger
Assignee: Sebastian Kruse
 Fix For: 0.9


 Too many users asked for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (FLINK-1981) Add GZip support

2015-06-08 Thread Maximilian Michels (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels closed FLINK-1981.
-
   Resolution: Implemented
Fix Version/s: 0.9

 Add GZip support
 

 Key: FLINK-1981
 URL: https://issues.apache.org/jira/browse/FLINK-1981
 Project: Flink
  Issue Type: New Feature
  Components: Core
Reporter: Sebastian Kruse
Assignee: Sebastian Kruse
Priority: Minor
 Fix For: 0.9


 GZip, as a commonly used compression format, should be supported in the same 
 way as the already supported deflate files. This allows to use GZip files 
 with any subclass of FileInputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [contrib] Storm compatibility

2015-06-08 Thread mbalassi

Github user mbalassi commented on the pull request:

https://github.com/apache/flink/pull/764#issuecomment-109927735
  
Hey @mjsax, now the changes are on the master, can you make the changes on 
your side please?
Then it is ready to merge on my side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576817#comment-14576817
 ] 

ASF GitHub Bot commented on FLINK-2054:
---

GitHub user aljoscha opened a pull request:

https://github.com/apache/flink/pull/803

[FLINK-2054] Add object-reuse switch for streaming

The switch was already there: enableObjectReuse() in ExecutionConfig.
This was simply not considered by the streaming runtime. This change now
draws a copy before forwarding an element to a chained operator when
object reuse is disabled.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aljoscha/flink streaming-reuse-switch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/803.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #803


commit c1896ef6355f47f89215acff7eba9f944c461196
Author: Aljoscha Krettek aljoscha.kret...@gmail.com
Date:   2015-06-08T09:32:27Z

[FLINK-2054] Add object-reuse switch for streaming

The switch was already there: enableObjectReuse() in ExecutionConfig.
This was simply not considered by the streaming runtime. This change now
draws a copy before forwarding an element to a chained operator when
object reuse is disabled.




 StreamOperator rework removed copy calls when passing output to a chained 
 operator
 --

 Key: FLINK-2054
 URL: https://issues.apache.org/jira/browse/FLINK-2054
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Gyula Fora
Assignee: Aljoscha Krettek
Priority: Blocker

 Before the recent rework of stream operators to be push based, operators held 
 the semantics that any input (and also output to be specific) will not be 
 mutated afterwards. This was achieved by simply copying records that were 
 passed to other chained operators.
 This feature has been removed thus introducing a major break in the operator 
 mutability guarantees. 
 To make chaining viable in all cases (and to prevent hidden bugs) we need to 
 reintroduce the copying logic. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...

2015-06-08 Thread aljoscha

GitHub user aljoscha opened a pull request:

https://github.com/apache/flink/pull/803

[FLINK-2054] Add object-reuse switch for streaming

The switch was already there: enableObjectReuse() in ExecutionConfig.
This was simply not considered by the streaming runtime. This change now
draws a copy before forwarding an element to a chained operator when
object reuse is disabled.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aljoscha/flink streaming-reuse-switch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/803.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #803


commit c1896ef6355f47f89215acff7eba9f944c461196
Author: Aljoscha Krettek aljoscha.kret...@gmail.com
Date:   2015-06-08T09:32:27Z

[FLINK-2054] Add object-reuse switch for streaming

The switch was already there: enableObjectReuse() in ExecutionConfig.
This was simply not considered by the streaming runtime. This change now
draws a copy before forwarding an element to a chained operator when
object reuse is disabled.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)

2015-06-08 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576837#comment-14576837
 ] 

Aljoscha Krettek commented on FLINK-2161:
-

The Scala REPL has support for adding jars at runtime. We would just need to 
add support for sending them along with the compiled classes.

 Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
 --

 Key: FLINK-2161
 URL: https://issues.apache.org/jira/browse/FLINK-2161
 Project: Flink
  Issue Type: Improvement
Reporter: Till Rohrmann
Assignee: Nikolaas Steenbergen

 Currently, there is no easy way to load and ship external libraries/jars with 
 Flink's Scala shell. Assume that you want to run some Gelly graph algorithms 
 from within the Scala shell, then you have to put the Gelly jar manually in 
 the lib directory and make sure that this jar is also available on your 
 cluster, because it is not shipped with the user code. 
 It would be good to have a simple mechanism how to specify additional jars 
 upon startup of the Scala shell. These jars should then also be shipped to 
 the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-1844) Add Normaliser to ML library

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576962#comment-14576962
 ] 

ASF GitHub Bot commented on FLINK-1844:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/798#issuecomment-109937398
  
Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser 
to ML library* so that JIRA picks up on the issue.


 Add Normaliser to ML library
 

 Key: FLINK-1844
 URL: https://issues.apache.org/jira/browse/FLINK-1844
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Faye Beligianni
Assignee: Faye Beligianni
Priority: Minor
  Labels: ML, Starter

 In many algorithms in ML, the features' values would be better to lie between 
 a given range of values, usually in the range (0,1) [1]. Therefore, a 
 {{Transformer}} could be implemented to achieve that normalisation.
 Resources: 
 [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread thvasilo

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/798#issuecomment-109937398
  
Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser 
to ML library* so that JIRA picks up on the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902170
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902243
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
--- End diff --

My thoughts were that we will provide the whole

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576997#comment-14576997
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902179
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902179
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a

[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...

2015-06-08 Thread aljoscha

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/803#issuecomment-109944731
  
Ok, I'm working on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577045#comment-14577045
 ] 

ASF GitHub Bot commented on FLINK-2054:
---

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/803#issuecomment-109956597
  
Thanks. It's working now.


 StreamOperator rework removed copy calls when passing output to a chained 
 operator
 --

 Key: FLINK-2054
 URL: https://issues.apache.org/jira/browse/FLINK-2054
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Gyula Fora
Assignee: Aljoscha Krettek
Priority: Blocker

 Before the recent rework of stream operators to be push based, operators held 
 the semantics that any input (and also output to be specific) will not be 
 mutated afterwards. This was achieved by simply copying records that were 
 passed to other chained operators.
 This feature has been removed thus introducing a major break in the operator 
 mutability guarantees. 
 To make chaining viable in all cases (and to prevent hidden bugs) we need to 
 reintroduce the copying logic. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] flink pull request: [contrib] Storm compatibility

2015-06-08 Thread mjsax

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/764#issuecomment-109940058
  
It should be possible to put your last commit on top, after the rebase is 
done. I will do it in a new branch to preserve this one and cherry pick your 
last commit at the end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (FLINK-2182) Add stateful Streaming Sequence Source

2015-06-08 Thread Aljoscha Krettek (JIRA)

Aljoscha Krettek created FLINK-2182:
---

 Summary: Add stateful Streaming Sequence Source
 Key: FLINK-2182
 URL: https://issues.apache.org/jira/browse/FLINK-2182
 Project: Flink
  Issue Type: Improvement
  Components: eaming, Streaming
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

2015-06-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576996#comment-14576996
 ] 

ASF GitHub Bot commented on FLINK-2072:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902170
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply

1 2 3 >

1 - 100 of 291 matches

Mail list logo