[jira] [Updated] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1
[ https://issues.apache.org/jira/browse/FLINK-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sachin Goel updated FLINK-2187: --- Affects Version/s: 0.9 KMeans clustering is not present in release-0.9-rc1 --- Key: FLINK-2187 URL: https://issues.apache.org/jira/browse/FLINK-2187 Project: Flink Issue Type: Bug Affects Versions: 0.9 Reporter: Sachin Goel The Flink ml readme [https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] contains kmeans clustering in its description. However, the kmeans is not part of the ML library yet. It is still only present in examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110096269 No worries :) I am used to it. And I think it's good to ensure high code quality! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577588#comment-14577588 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110096269 No worries :) I am used to it. And I think it's good to ensure high code quality! Allow comments in 'slaves' file --- Key: FLINK-2174 URL: https://issues.apache.org/jira/browse/FLINK-2174 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Reporter: Matthias J. Sax Assignee: Matthias J. Sax Priority: Trivial Fix For: 0.10 Currently, each line in slaves in interpreded as a host name. Scripts should skip lines starting with '#'. Also allow for comments at the end of a line and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2187) KMeans clustering is not present in release-0.9-rc1
Sachin Goel created FLINK-2187: -- Summary: KMeans clustering is not present in release-0.9-rc1 Key: FLINK-2187 URL: https://issues.apache.org/jira/browse/FLINK-2187 Project: Flink Issue Type: Bug Reporter: Sachin Goel The Flink ml readme [https://github.com/apache/flink/tree/release-0.9-rc1/flink-staging/flink-ml] contains kmeans clustering in its description. However, the kmeans is not part of the ML library yet. It is still only present in examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577863#comment-14577863 ] ASF GitHub Bot commented on FLINK-2093: --- Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110143059 Hi @shghatge, you don't need to close a PR in order to update it. You can simply update (push or push --force into) the branch from which you created the PR and Github will automatically update the PR. This helps to have all comments about your implementation in one place. Add a difference method to Gelly's Graph class -- Key: FLINK-2093 URL: https://issues.apache.org/jira/browse/FLINK-2093 Project: Flink Issue Type: New Feature Components: Gelly Affects Versions: 0.9 Reporter: Andra Lungu Assignee: Shivani Ghatge Priority: Minor This method will compute the difference between two graphs, returning a new graph containing the vertices and edges that the current graph and the input graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user fhueske commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110143059 Hi @shghatge, you don't need to close a PR in order to update it. You can simply update (push or push --force into) the branch from which you created the PR and Github will automatically update the PR. This helps to have all comments about your implementation in one place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2185) Rework semantics for .setSeed function of SVM
[ https://issues.apache.org/jira/browse/FLINK-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577893#comment-14577893 ] Dmitriy Lyubimov commented on FLINK-2185: - is svm work in Flik non-linear or linear only? Rework semantics for .setSeed function of SVM - Key: FLINK-2185 URL: https://issues.apache.org/jira/browse/FLINK-2185 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Priority: Minor Labels: ML Fix For: 0.10 Currently setting the seed for the SVM does not have the expected result of producing the same output for each run of the algorithm, potentially due to the randomness in data partitioning. We should then rework the way the algorithm works to either ensure we get the exact same results when the seed is set, or if that is not possible the setSeed function should be removed. Also in the current implementation the default value of 0 for the seed means that all runs for which we don't set the seed will get the same seed which is not the intended behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1601) Sometimes the YARNSessionFIFOITCase fails on Travis
[ https://issues.apache.org/jira/browse/FLINK-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577841#comment-14577841 ] Sachin Goel commented on FLINK-1601: This test is failing again on travis build. The issue seems different however. Here is the log: https://api.travis-ci.org/jobs/65946506/log.txt?deansi=true Sometimes the YARNSessionFIFOITCase fails on Travis --- Key: FLINK-1601 URL: https://issues.apache.org/jira/browse/FLINK-1601 Project: Flink Issue Type: Bug Reporter: Till Rohrmann Assignee: Robert Metzger Sometimes the {{YARNSessionFIFOITCase}} fails on Travis with the following exception. {code} Tests run: 8, Failures: 8, Errors: 0, Skipped: 0, Time elapsed: 71.375 sec FAILURE! - in org.apache.flink.yarn.YARNSessionFIFOITCase perJobYarnCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) Time elapsed: 60.707 sec FAILURE! java.lang.AssertionError: During the timeout period of 60 seconds the expected string did not show up at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.yarn.YarnTestBase.runWithArgs(YarnTestBase.java:315) at org.apache.flink.yarn.YARNSessionFIFOITCase.perJobYarnCluster(YARNSessionFIFOITCase.java:185) testQueryCluster(org.apache.flink.yarn.YARNSessionFIFOITCase) Time elapsed: 0.507 sec FAILURE! java.lang.AssertionError: There is at least one application on the cluster is not finished at org.junit.Assert.fail(Assert.java:88) at org.apache.flink.yarn.YarnTestBase.checkClusterEmpty(YarnTestBase.java:146) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) {code} The result is {code} Failed tests: YARNSessionFIFOITCase.perJobYarnCluster:185-YarnTestBase.runWithArgs:315 During the timeout period of 60 seconds the expected string did not show up YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not finished YARNSessionFIFOITCaseYarnTestBase.checkClusterEmpty:146 There is at least one application on the cluster is not
[GitHub] flink pull request: [docs] Fix some typos and grammar in the Strea...
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/806 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user shghatge commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110078878 Then it was just removing vertices! Talk about swatting a Fly with a Sledgehammer! I will do all the changes you suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577486#comment-14577486 ] ASF GitHub Bot commented on FLINK-2093: --- Github user shghatge commented on the pull request: https://github.com/apache/flink/pull/807#issuecomment-110078878 Then it was just removing vertices! Talk about swatting a Fly with a Sledgehammer! I will do all the changes you suggested. Add a difference method to Gelly's Graph class -- Key: FLINK-2093 URL: https://issues.apache.org/jira/browse/FLINK-2093 Project: Flink Issue Type: New Feature Components: Gelly Affects Versions: 0.9 Reporter: Andra Lungu Assignee: Shivani Ghatge Priority: Minor This method will compute the difference between two graphs, returning a new graph containing the vertices and edges that the current graph and the input graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2093) Add a difference method to Gelly's Graph class
[ https://issues.apache.org/jira/browse/FLINK-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577487#comment-14577487 ] ASF GitHub Bot commented on FLINK-2093: --- Github user shghatge closed the pull request at: https://github.com/apache/flink/pull/807 Add a difference method to Gelly's Graph class -- Key: FLINK-2093 URL: https://issues.apache.org/jira/browse/FLINK-2093 Project: Flink Issue Type: New Feature Components: Gelly Affects Versions: 0.9 Reporter: Andra Lungu Assignee: Shivani Ghatge Priority: Minor This method will compute the difference between two graphs, returning a new graph containing the vertices and edges that the current graph and the input graph don't have in common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2093][gelly] Added difference method
Github user shghatge closed the pull request at: https://github.com/apache/flink/pull/807 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels resolved FLINK-2174. --- Resolution: Implemented Fix Version/s: 0.10 Allow comments in 'slaves' file --- Key: FLINK-2174 URL: https://issues.apache.org/jira/browse/FLINK-2174 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Reporter: Matthias J. Sax Assignee: Matthias J. Sax Priority: Trivial Fix For: 0.10 Currently, each line in slaves in interpreded as a host name. Scripts should skip lines starting with '#'. Also allow for comments at the end of a line and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110076740 Thank you! Sorry for giving you a hard time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577479#comment-14577479 ] ASF GitHub Bot commented on FLINK-2174: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/796#issuecomment-110076740 Thank you! Sorry for giving you a hard time. Allow comments in 'slaves' file --- Key: FLINK-2174 URL: https://issues.apache.org/jira/browse/FLINK-2174 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Reporter: Matthias J. Sax Assignee: Matthias J. Sax Priority: Trivial Currently, each line in slaves in interpreded as a host name. Scripts should skip lines starting with '#'. Also allow for comments at the end of a line and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2174] Allow comments in 'slaves' file
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/796 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2174) Allow comments in 'slaves' file
[ https://issues.apache.org/jira/browse/FLINK-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577482#comment-14577482 ] ASF GitHub Bot commented on FLINK-2174: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/796 Allow comments in 'slaves' file --- Key: FLINK-2174 URL: https://issues.apache.org/jira/browse/FLINK-2174 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Reporter: Matthias J. Sax Assignee: Matthias J. Sax Priority: Trivial Currently, each line in slaves in interpreded as a host name. Scripts should skip lines starting with '#'. Also allow for comments at the end of a line and skip empty lines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1916) EOFException when running delta-iteration job
[ https://issues.apache.org/jira/browse/FLINK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-1916: -- Affects Version/s: 0.10 EOFException when running delta-iteration job - Key: FLINK-1916 URL: https://issues.apache.org/jira/browse/FLINK-1916 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.10 Environment: 0.9-milestone-1 Exception on the cluster, local execution works Reporter: Stefan Bunk The delta-iteration program in [1] ends with an java.io.EOFException at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291) at org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307) at org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62) at org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89) at org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29) at org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219) at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536) at org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217) at java.lang.Thread.run(Thread.java:745) For logs and the accompanying mailing list discussion see below. When running with slightly different memory configuration, as hinted on the mailing list, I sometimes also get this exception: 19.Apr. 13:39:29 INFO Task - IterationHead(WorksetIteration (Resolved-Redirects)) (10/10) switched to FAILED : java.lang.IndexOutOfBoundsException: Index: 161, Size: 161 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.resetTo(InMemoryPartition.java:352) at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.access$100(InMemoryPartition.java:301) at org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:226) at org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536) at org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209) at org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217) at java.lang.Thread.run(Thread.java:745) [1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57 [2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4 [3] One task manager's logs: https://gist.github.com/knub/8f2f953da95c8d7adefc [4] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/EOFException-when-running-Flink-job-td1092.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576642#comment-14576642 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-109875609 Great news :) Thanks Ufuk! Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1319][core] Add static code analysis fo...
Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-109875609 Great news :) Thanks Ufuk! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2080) Execute Flink with sbt
[ https://issues.apache.org/jira/browse/FLINK-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576699#comment-14576699 ] ASF GitHub Bot commented on FLINK-2080: --- Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/787#issuecomment-109893575 Merging. Execute Flink with sbt -- Key: FLINK-2080 URL: https://issues.apache.org/jira/browse/FLINK-2080 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 0.8.1 Reporter: Christian Wuertz Assignee: Christian Wuertz Priority: Minor Attachments: sbt.patch I tried to execute some of the flink example applications on my local machine using sbt. To get this running without class loading issues it was important to make sure that Flink is executed in its own JVM and not in the sbt JVM. This can be done very easily, but it would have been nice to know that in advance. So maybe you guys want to add this to the Flink documentation. An example can be found here: https://github.com/Teots/flink-sbt (The trick was to add fork in run := true to the build.sbt) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2181) SessionWindowing example has a memleak
Gabor Gevay created FLINK-2181: -- Summary: SessionWindowing example has a memleak Key: FLINK-2181 URL: https://issues.apache.org/jira/browse/FLINK-2181 Project: Flink Issue Type: Bug Components: Examples, Streaming Reporter: Gabor Gevay Assignee: Gabor Gevay The trigger policy objects belonging to already terminated sessions are kept in memory, and also NotifyOnLastGlobalElement gets called on them. This causes the program to eat up more and more memory, and also to get slower with time. Mailing list discussion: http://mail-archives.apache.org/mod_mbox/flink-dev/201505.mbox/%3CCADXjeyBW9uaA=ayde+9r+qh85pblyavuegsr7h8pwkyntm9...@mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2136] Adding DataStream tests for Scala...
Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/771#issuecomment-109894121 Thanks for merging. I removed the line that failed the test (I mistakenly checked a non-deterministic id). https://github.com/gaborhermann/flink/commits/FLINK-2136 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2136) Test the streaming scala API
[ https://issues.apache.org/jira/browse/FLINK-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576705#comment-14576705 ] ASF GitHub Bot commented on FLINK-2136: --- Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/771#issuecomment-109894121 Thanks for merging. I removed the line that failed the test (I mistakenly checked a non-deterministic id). https://github.com/gaborhermann/flink/commits/FLINK-2136 Test the streaming scala API Key: FLINK-2136 URL: https://issues.apache.org/jira/browse/FLINK-2136 Project: Flink Issue Type: Test Components: Scala API, Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Gábor Hermann There are no test covering the streaming scala API. I would suggest to test whether the StreamGraph created by a certain operation looks as expected. Deeper layers and runtime should not be tested here, that is done in streaming-core. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31894690 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,113 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: a href=../mlFlinkML/a - MinMax Scaler +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T : Vector]: DataSet[T] = Unit` +* `fit: DataSet[LabeledVector] = Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T : Vector]: DataSet[T] = DataSet[T]` +* `transform: DataSet[LabeledVector] = DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + table class=table table-bordered + thead +tr + th class=text-left style=width: 20%Parameters/th + th class=text-centerDescription/th +/tr + /thead + + tbody +tr + tdstrongMin/strong/td + td +p + The minimum value of the range for the scaled data set. (Default value: strong0.0/strong) +/p + /td +/tr +tr + tdstrongStd/strong/td --- End diff -- Shouldn't this be called `Max`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2153] Exclude Hbase annotations
Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/800#issuecomment-109893732 I prefer 2 warnings to a sometimes error. :) Merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2153) Exclude dependency on hbase annotations module
[ https://issues.apache.org/jira/browse/FLINK-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576701#comment-14576701 ] ASF GitHub Bot commented on FLINK-2153: --- Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/800#issuecomment-109893732 I prefer 2 warnings to a sometimes error. :) Merging. Exclude dependency on hbase annotations module -- Key: FLINK-2153 URL: https://issues.apache.org/jira/browse/FLINK-2153 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Márton Balassi [ERROR] Failed to execute goal on project flink-hbase: Could not resolve dependencies for project org.apache.flink:flink-hbase:jar:0.9-SNAPSHOT: Could not find artifact jdk.tools:jdk.tools:jar:1.7 at specified path /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar There is a Spark issue for this [1] with a solution [2]. [1] https://issues.apache.org/jira/browse/SPARK-4455 [2] https://github.com/apache/spark/pull/3286/files -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
[ https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolaas Steenbergen reassigned FLINK-2161: --- Assignee: Nikolaas Steenbergen Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML) -- Key: FLINK-2161 URL: https://issues.apache.org/jira/browse/FLINK-2161 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Nikolaas Steenbergen Currently, there is no easy way to load and ship external libraries/jars with Flink's Scala shell. Assume that you want to run some Gelly graph algorithms from within the Scala shell, then you have to put the Gelly jar manually in the lib directory and make sure that this jar is also available on your cluster, because it is not shipped with the user code. It would be good to have a simple mechanism how to specify additional jars upon startup of the Scala shell. These jars should then also be shipped to the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: Add CONTRIBUTING.md file with pointers to our ...
Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/778#issuecomment-109893654 Merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2080] Document how to use Flink with sb...
Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/787#issuecomment-109893575 Merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2136) Test the streaming scala API
[ https://issues.apache.org/jira/browse/FLINK-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576755#comment-14576755 ] ASF GitHub Bot commented on FLINK-2136: --- Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/771#issuecomment-109909701 Of course. I took up the JIRA issue. Test the streaming scala API Key: FLINK-2136 URL: https://issues.apache.org/jira/browse/FLINK-2136 Project: Flink Issue Type: Test Components: Scala API, Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Gábor Hermann There are no test covering the streaming scala API. I would suggest to test whether the StreamGraph created by a certain operation looks as expected. Deeper layers and runtime should not be tested here, that is done in streaming-core. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2136] Adding DataStream tests for Scala...
Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/771#issuecomment-109909701 Of course. I took up the JIRA issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2180) Streaming iterate test fails spuriously
[ https://issues.apache.org/jira/browse/FLINK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576758#comment-14576758 ] ASF GitHub Bot commented on FLINK-2180: --- GitHub user gaborhermann opened a pull request: https://github.com/apache/flink/pull/802 [wip] [FLINK-2180] [streaming] Iteration test fix * Fixed the iteration test at DataStreamTest * Will add more tests for DataStream * Will fix IterateTest You can merge this pull request into a Git repository by running: $ git pull https://github.com/gaborhermann/flink FLINK-2180 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/802.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #802 commit bf612f4e4133712fc454ab0417aa563e08a526b4 Author: Gábor Hermann reckone...@gmail.com Date: 2015-06-08T07:37:40Z [streaming] Added iteration test for DataStream Streaming iterate test fails spuriously --- Key: FLINK-2180 URL: https://issues.apache.org/jira/browse/FLINK-2180 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Gábor Hermann Following output seen occasionally: Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.667 sec FAILURE! - in org.apache.flink.streaming.api.IterateTest test(org.apache.flink.streaming.api.IterateTest) Time elapsed: 3.662 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.streaming.api.IterateTest.test(IterateTest.java:154) See: https://travis-ci.org/mbalassi/flink/jobs/65803465 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895785 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + + +class MinMaxScalerITSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of Flink's MinMax Scaler + + import MinMaxScalerData._ + + it should scale the vectors' values to be restricted in the (0.0,1.0) range in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data) +val minMaxScaler = MinMaxScaler() +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data.length) + +for (vector - scaledVectors) { + val test = vector.asBreeze.forall(fv = { +fv = 0.0 fv = 1.0 --- End diff -- Maybe we could not only compare whether the data lies between `0` and `1` but also whether the vectors have been correctly scaled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1981] add support for GZIP files
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/762 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576760#comment-14576760 ] ASF GitHub Bot commented on FLINK-1981: --- Github user mxm commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-109912571 Thank you for your contribution. Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576759#comment-14576759 ] ASF GitHub Bot commented on FLINK-1981: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/762 Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1981] add support for GZIP files
Github user mxm commented on the pull request: https://github.com/apache/flink/pull/762#issuecomment-109912571 Thank you for your contribution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895797 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + + +class MinMaxScalerITSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of Flink's MinMax Scaler + + import MinMaxScalerData._ + + it should scale the vectors' values to be restricted in the (0.0,1.0) range in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data) +val minMaxScaler = MinMaxScaler() +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data.length) + +for (vector - scaledVectors) { + val test = vector.asBreeze.forall(fv = { +fv = 0.0 fv = 1.0 + }) + test shouldEqual (true) +} + + } + + it should scale vectors' values in the (-1.0,1.0) range in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data2) +val minMaxScaler = MinMaxScaler().setMin(-1.0).setMax(1.0) +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data2.length) + +for (labeledVector - scaledVectors) { + val test = labeledVector.vector.asBreeze.forall(lv = { +lv = -1.0 lv = 1.0 --- End diff -- The same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576766#comment-14576766 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896248 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, --- End diff -- are the inputs called predictors? Add a quickstart guide for FlinkML -- Key: FLINK-2072 URL: https://issues.apache.org/jira/browse/FLINK-2072 Project: Flink Issue Type: New Feature Components: Documentation, Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Fix For: 0.9 We need a quickstart guide that introduces users to the core concepts of FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576773#comment-14576773 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896759 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896733 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896759 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896845 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897514 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those --- End diff -- Will add reference --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576799#comment-14576799 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897535 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) --- End diff -- Good catch, will add. Add a quickstart guide for FlinkML -- Key: FLINK-2072 URL: https://issues.apache.org/jira/browse/FLINK-2072 Project: Flink Issue Type: New Feature Components: Documentation, Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Fix For: 0.9 We need a quickstart guide that introduces users to the core concepts of FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897546 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets --- End diff -- Yup, will remove --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Closed] (FLINK-2080) Execute Flink with sbt
[ https://issues.apache.org/jira/browse/FLINK-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Márton Balassi closed FLINK-2080. - Resolution: Fixed Fix Version/s: 0.9 Implemented via 3e9af82. Execute Flink with sbt -- Key: FLINK-2080 URL: https://issues.apache.org/jira/browse/FLINK-2080 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 0.8.1 Reporter: Christian Wuertz Assignee: Christian Wuertz Priority: Minor Fix For: 0.9 Attachments: sbt.patch I tried to execute some of the flink example applications on my local machine using sbt. To get this running without class loading issues it was important to make sure that Flink is executed in its own JVM and not in the sbt JVM. This can be done very easily, but it would have been nice to know that in advance. So maybe you guys want to add this to the Flink documentation. An example can be found here: https://github.com/Teots/flink-sbt (The trick was to add fork in run := true to the build.sbt) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576798#comment-14576798 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897514 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those --- End diff -- Will add reference Add a quickstart guide for FlinkML -- Key: FLINK-2072 URL: https://issues.apache.org/jira/browse/FLINK-2072 Project: Flink Issue Type: New Feature Components: Documentation, Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Fix For: 0.9 We need a quickstart guide that introduces users to the core concepts of FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897535 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) --- End diff -- Good catch, will add. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897818 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[GitHub] flink pull request: [FLINK-2116] [ml] Reusing predict operation fo...
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/772#issuecomment-109926023 I created a [JIRA ticket](https://issues.apache.org/jira/browse/FLINK-2157?filter=12331885) for the evaluation of predictors. I think we should deal with it separately in order to keep the PR small and reviewable. But I agree, this is an important feature for any ML library. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31899145 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + + +class MinMaxScalerITSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of Flink's MinMax Scaler + + import MinMaxScalerData._ + + it should scale the vectors' values to be restricted in the (0.0,1.0) range in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data) +val minMaxScaler = MinMaxScaler() +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data.length) + +for (vector - scaledVectors) { + val test = vector.asBreeze.forall(fv = { +fv = 0.0 fv = 1.0 --- End diff -- Yes it is. This assures that if someone changes something of the `Transformer` logic, then he will see an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: Clean up naming of State/Checkpoint Interfaces
Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/671#issuecomment-109933236 Sorry, I thought we had this mostly in already. My bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576999#comment-14576999 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31902243 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply
[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator
[ https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576998#comment-14576998 ] ASF GitHub Bot commented on FLINK-2054: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/803#issuecomment-109944731 Ok, I'm working on it. StreamOperator rework removed copy calls when passing output to a chained operator -- Key: FLINK-2054 URL: https://issues.apache.org/jira/browse/FLINK-2054 Project: Flink Issue Type: Bug Components: Streaming Reporter: Gyula Fora Assignee: Aljoscha Krettek Priority: Blocker Before the recent rework of stream operators to be push based, operators held the semantics that any input (and also output to be specific) will not be mutated afterwards. This was achieved by simply copying records that were passed to other chained operators. This feature has been removed thus introducing a major break in the operator mutability guarantees. To make chaining viable in all cases (and to prevent hidden bugs) we need to reintroduce the copying logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator
[ https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577040#comment-14577040 ] ASF GitHub Bot commented on FLINK-2054: --- Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/803#issuecomment-109955810 I updated it. By default copying is on now. System operators can request to override the reuse setting. StreamOperator rework removed copy calls when passing output to a chained operator -- Key: FLINK-2054 URL: https://issues.apache.org/jira/browse/FLINK-2054 Project: Flink Issue Type: Bug Components: Streaming Reporter: Gyula Fora Assignee: Aljoscha Krettek Priority: Blocker Before the recent rework of stream operators to be push based, operators held the semantics that any input (and also output to be specific) will not be mutated afterwards. This was achieved by simply copying records that were passed to other chained operators. This feature has been removed thus introducing a major break in the operator mutability guarantees. To make chaining viable in all cases (and to prevent hidden bugs) we need to reintroduce the copying logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/803#issuecomment-109955810 I updated it. By default copying is on now. System operators can request to override the reuse setting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31894794 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,113 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: a href=../mlFlinkML/a - MinMax Scaler +--- +!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +License); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +-- + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T : Vector]: DataSet[T] = Unit` +* `fit: DataSet[LabeledVector] = Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T : Vector]: DataSet[T] = DataSet[T]` +* `transform: DataSet[LabeledVector] = DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + table class=table table-bordered + thead +tr + th class=text-left style=width: 20%Parameters/th + th class=text-centerDescription/th +/tr + /thead + + tbody +tr + tdstrongMin/strong/td + td +p + The minimum value of the range for the scaled data set. (Default value: strong0.0/strong) +/p + /td +/tr +tr + tdstrongStd/strong/td --- End diff -- Yes, you are right! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31894783 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = (0,1). + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return +*/ + implicit def fitVectorMinMaxScaler[T : Vector] = new FitOperation[MinMaxScaler,
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109913485 Really good work @fobeligi. The code is really well structured and documented. I had only some minor comments. When you have them addressed, I think it's good to be merged :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896248 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, --- End diff -- are the inputs called predictors? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576778#comment-14576778 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896948 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896965 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576779#comment-14576779 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896965 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896996 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576795#comment-14576795 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897426 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, --- End diff -- It's more of a statistics terminology, see [synonyms](http://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms). In ML features is more common so I will change it to that. Add a quickstart guide for FlinkML -- Key: FLINK-2072 URL: https://issues.apache.org/jira/browse/FLINK-2072 Project: Flink Issue Type: New Feature Components: Documentation, Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Fix For: 0.9 We need a quickstart guide that introduces users to the core concepts of FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897426 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, --- End diff -- It's more of a statistics terminology, see [synonyms](http://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms). In ML features is more common so I will change it to that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576818#comment-14576818 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31898968 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply
[GitHub] flink pull request: Clean up naming of State/Checkpoint Interfaces
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/671#issuecomment-109932208 Why? There is still a naming difference between methods that to similar things at different levels. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...
Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/803#issuecomment-109955074 As a note: I've just run into this issue at chained streaming aggregations when writing tests (not copying created wrong results). I will check again after this has been solved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator
[ https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577035#comment-14577035 ] ASF GitHub Bot commented on FLINK-2054: --- Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/803#issuecomment-109955074 As a note: I've just run into this issue at chained streaming aggregations when writing tests (not copying created wrong results). I will check again after this has been solved. StreamOperator rework removed copy calls when passing output to a chained operator -- Key: FLINK-2054 URL: https://issues.apache.org/jira/browse/FLINK-2054 Project: Flink Issue Type: Bug Components: Streaming Reporter: Gyula Fora Assignee: Aljoscha Krettek Priority: Blocker Before the recent rework of stream operators to be push based, operators held the semantics that any input (and also output to be specific) will not be mutated afterwards. This was achieved by simply copying records that were passed to other chained operators. This feature has been removed thus introducing a major break in the operator mutability guarantees. To make chaining viable in all cases (and to prevent hidden bugs) we need to reintroduce the copying logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895117 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = (0,1). + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return +*/ + implicit def fitVectorMinMaxScaler[T : Vector] = new FitOperation[MinMaxScaler,
[GitHub] flink pull request: [FLINK-2136] Adding DataStream tests for Scala...
Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/771#issuecomment-109907919 Please open a PR with it. Could you give FLINK-2180 a try in the meantime please? It looks similar on paper. You can add then to the same pull request of course. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2136) Test the streaming scala API
[ https://issues.apache.org/jira/browse/FLINK-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576747#comment-14576747 ] ASF GitHub Bot commented on FLINK-2136: --- Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/771#issuecomment-109907919 Please open a PR with it. Could you give FLINK-2180 a try in the meantime please? It looks similar on paper. You can add then to the same pull request of course. :) Test the streaming scala API Key: FLINK-2136 URL: https://issues.apache.org/jira/browse/FLINK-2136 Project: Flink Issue Type: Test Components: Scala API, Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Gábor Hermann There are no test covering the streaming scala API. I would suggest to test whether the StreamGraph created by a certain operation looks as expected. Deeper layers and runtime should not be tested here, that is done in streaming-core. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2180) Streaming iterate test fails spuriously
[ https://issues.apache.org/jira/browse/FLINK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gábor Hermann reassigned FLINK-2180: Assignee: Gábor Hermann Streaming iterate test fails spuriously --- Key: FLINK-2180 URL: https://issues.apache.org/jira/browse/FLINK-2180 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Gábor Hermann Following output seen occasionally: Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.667 sec FAILURE! - in org.apache.flink.streaming.api.IterateTest test(org.apache.flink.streaming.api.IterateTest) Time elapsed: 3.662 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.flink.streaming.api.IterateTest.test(IterateTest.java:154) See: https://travis-ci.org/mbalassi/flink/jobs/65803465 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576762#comment-14576762 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896076 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) --- End diff -- period missing Add a quickstart guide for FlinkML -- Key: FLINK-2072 URL: https://issues.apache.org/jira/browse/FLINK-2072 Project: Flink Issue Type: New Feature Components: Documentation, Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Fix For: 0.9 We need a quickstart guide that introduces users to the core concepts of FlinkML to get them up and running quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576770#comment-14576770 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896535 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896535 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576780#comment-14576780 ] ASF GitHub Bot commented on FLINK-2072: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31896996 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897731 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576803#comment-14576803 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31897731 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply
[jira] [Resolved] (FLINK-1685) Document how to read gzip/compressed files with Flink
[ https://issues.apache.org/jira/browse/FLINK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels resolved FLINK-1685. --- Resolution: Fixed Fix Version/s: 0.9 Document how to read gzip/compressed files with Flink - Key: FLINK-1685 URL: https://issues.apache.org/jira/browse/FLINK-1685 Project: Flink Issue Type: Bug Components: Documentation Reporter: Robert Metzger Assignee: Sebastian Kruse Fix For: 0.9 Too many users asked for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-1981) Add GZip support
[ https://issues.apache.org/jira/browse/FLINK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels closed FLINK-1981. - Resolution: Implemented Fix Version/s: 0.9 Add GZip support Key: FLINK-1981 URL: https://issues.apache.org/jira/browse/FLINK-1981 Project: Flink Issue Type: New Feature Components: Core Reporter: Sebastian Kruse Assignee: Sebastian Kruse Priority: Minor Fix For: 0.9 GZip, as a commonly used compression format, should be supported in the same way as the already supported deflate files. This allows to use GZip files with any subclass of FileInputFormat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [contrib] Storm compatibility
Github user mbalassi commented on the pull request: https://github.com/apache/flink/pull/764#issuecomment-109927735 Hey @mjsax, now the changes are on the master, can you make the changes on your side please? Then it is ready to merge on my side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator
[ https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576817#comment-14576817 ] ASF GitHub Bot commented on FLINK-2054: --- GitHub user aljoscha opened a pull request: https://github.com/apache/flink/pull/803 [FLINK-2054] Add object-reuse switch for streaming The switch was already there: enableObjectReuse() in ExecutionConfig. This was simply not considered by the streaming runtime. This change now draws a copy before forwarding an element to a chained operator when object reuse is disabled. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aljoscha/flink streaming-reuse-switch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/803.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #803 commit c1896ef6355f47f89215acff7eba9f944c461196 Author: Aljoscha Krettek aljoscha.kret...@gmail.com Date: 2015-06-08T09:32:27Z [FLINK-2054] Add object-reuse switch for streaming The switch was already there: enableObjectReuse() in ExecutionConfig. This was simply not considered by the streaming runtime. This change now draws a copy before forwarding an element to a chained operator when object reuse is disabled. StreamOperator rework removed copy calls when passing output to a chained operator -- Key: FLINK-2054 URL: https://issues.apache.org/jira/browse/FLINK-2054 Project: Flink Issue Type: Bug Components: Streaming Reporter: Gyula Fora Assignee: Aljoscha Krettek Priority: Blocker Before the recent rework of stream operators to be push based, operators held the semantics that any input (and also output to be specific) will not be mutated afterwards. This was achieved by simply copying records that were passed to other chained operators. This feature has been removed thus introducing a major break in the operator mutability guarantees. To make chaining viable in all cases (and to prevent hidden bugs) we need to reintroduce the copying logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...
GitHub user aljoscha opened a pull request: https://github.com/apache/flink/pull/803 [FLINK-2054] Add object-reuse switch for streaming The switch was already there: enableObjectReuse() in ExecutionConfig. This was simply not considered by the streaming runtime. This change now draws a copy before forwarding an element to a chained operator when object reuse is disabled. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aljoscha/flink streaming-reuse-switch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/803.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #803 commit c1896ef6355f47f89215acff7eba9f944c461196 Author: Aljoscha Krettek aljoscha.kret...@gmail.com Date: 2015-06-08T09:32:27Z [FLINK-2054] Add object-reuse switch for streaming The switch was already there: enableObjectReuse() in ExecutionConfig. This was simply not considered by the streaming runtime. This change now draws a copy before forwarding an element to a chained operator when object reuse is disabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2161) Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML)
[ https://issues.apache.org/jira/browse/FLINK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576837#comment-14576837 ] Aljoscha Krettek commented on FLINK-2161: - The Scala REPL has support for adding jars at runtime. We would just need to add support for sending them along with the compiled classes. Flink Scala Shell does not support external jars (e.g. Gelly, FlinkML) -- Key: FLINK-2161 URL: https://issues.apache.org/jira/browse/FLINK-2161 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Nikolaas Steenbergen Currently, there is no easy way to load and ship external libraries/jars with Flink's Scala shell. Assume that you want to run some Gelly graph algorithms from within the Scala shell, then you have to put the Gelly jar manually in the lib directory and make sure that this jar is also available on your cluster, because it is not shipped with the user code. It would be good to have a simple mechanism how to specify additional jars upon startup of the Scala shell. These jars should then also be shipped to the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576962#comment-14576962 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109937398 Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser to ML library* so that JIRA picks up on the issue. Add Normaliser to ML library Key: FLINK-1844 URL: https://issues.apache.org/jira/browse/FLINK-1844 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Faye Beligianni Assignee: Faye Beligianni Priority: Minor Labels: ML, Starter In many algorithms in ML, the features' values would be better to lie between a given range of values, usually in the range (0,1) [1]. Therefore, a {{Transformer}} could be implemented to achieve that normalisation. Resources: [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109937398 Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser to ML library* so that JIRA picks up on the issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31902170 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31902243 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) --- End diff -- My thoughts were that we will provide the whole
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576997#comment-14576997 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31902179 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply
[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31902179 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply import the dataset then using: + +{% highlight scala %} + +val adultTrain = MLUtils.readLibSVM(path/to/a8a) +val adultTest = MLUtils.readLibSVM(path/to/a8a.t) + +{% endhighlight %} + +This gives us a
[GitHub] flink pull request: [FLINK-2054] Add object-reuse switch for strea...
Github user aljoscha commented on the pull request: https://github.com/apache/flink/pull/803#issuecomment-109944731 Ok, I'm working on it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2054) StreamOperator rework removed copy calls when passing output to a chained operator
[ https://issues.apache.org/jira/browse/FLINK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577045#comment-14577045 ] ASF GitHub Bot commented on FLINK-2054: --- Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/803#issuecomment-109956597 Thanks. It's working now. StreamOperator rework removed copy calls when passing output to a chained operator -- Key: FLINK-2054 URL: https://issues.apache.org/jira/browse/FLINK-2054 Project: Flink Issue Type: Bug Components: Streaming Reporter: Gyula Fora Assignee: Aljoscha Krettek Priority: Blocker Before the recent rework of stream operators to be push based, operators held the semantics that any input (and also output to be specific) will not be mutated afterwards. This was achieved by simply copying records that were passed to other chained operators. This feature has been removed thus introducing a major break in the operator mutability guarantees. To make chaining viable in all cases (and to prevent hidden bugs) we need to reintroduce the copying logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [contrib] Storm compatibility
Github user mjsax commented on the pull request: https://github.com/apache/flink/pull/764#issuecomment-109940058 It should be possible to put your last commit on top, after the rebase is done. I will do it in a new branch to preserve this one and cherry pick your last commit at the end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (FLINK-2182) Add stateful Streaming Sequence Source
Aljoscha Krettek created FLINK-2182: --- Summary: Add stateful Streaming Sequence Source Key: FLINK-2182 URL: https://issues.apache.org/jira/browse/FLINK-2182 Project: Flink Issue Type: Improvement Components: eaming, Streaming Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
[ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576996#comment-14576996 ] ASF GitHub Bot commented on FLINK-2072: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/792#discussion_r31902170 --- Diff: docs/libs/ml/quickstart.md --- @@ -24,4 +24,198 @@ under the License. * This will be replaced by the TOC {:toc} -Coming soon. +## Introduction + +FlinkML is designed to make learning from your data a straight-forward process, abstracting away +the complexities that usually come with having to deal with big data learning tasks. In this +quick-start guide we will show just how easy it is to solve a simple supervised learning problem +using FlinkML. But first some basics, feel free to skip the next few lines if you're already +familiar with Machine Learning (ML) + +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those +learned patterns to make predictions about the future. We can categorize most ML algorithms into +two major categories: Supervised and Unsupervised Learning. + +* Supervised Learning deals with learning a function (mapping) from a set of inputs +(predictors) to a set of outputs. The learning is done using a __training set__ of (input, +output) pairs that we use to approximate the mapping function. Supervised learning problems are +further divided into classification and regression problems. In classification problems we try to +predict the __class__ that an example belongs to, for example whether a user is going to click on +an ad or not. Regression problems are about predicting (real) numerical values, often called the dependent +variable, for example what the temperature will be tomorrow. + +* Unsupervised learning deals with discovering patterns and regularities in the data. An example +of this would be __clustering__, where we try to discover groupings of the data from the +descriptive features. Unsupervised learning can also be used for feature selection, for example +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis). + +## Loading data + +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized +functions for formatted data, such as the LibSVM format. For supervised learning problems it is +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector` +object will have a FlinkML `Vector` member representing the features of the example and a `Double` +member which represents the label, which could be the class in a classification problem, or the dependent +variable for a regression problem. + +# TODO: Get dataset that has separate train and test sets +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data). + +We can load the data as a `DataSet[String]` first: + +{% highlight scala %} + +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)](/path/to/breast-cancer-wisconsin.data) + +{% endhighlight %} + +The dataset has some missing values indicated by `?`. We can filter those rows out and +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the +dataset with the FlinkML classification algorithms. + +{% highlight scala %} + +val cancerLV = cancer + .map(_.productIterator.toList) + .filter(!_.contains(?)) + .map{list = +val numList = list.map(_.asInstanceOf[String].toDouble) +LabeledVector(numList(11), DenseVector(numList.take(10).toArray)) +} + +{% endhighlight %} + +We can then use this data to train a learner. + +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object. +You can also save datasets in the LibSVM format using the `writeLibSVM` function. +Let's import the Adult (a9a) dataset. You can download the +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a) +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t). + +We can simply