[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/792#issuecomment-111068643
  
Perfect, thanks. Will merge it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/792#issuecomment-111043422
  
Addressed the last PR comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/792


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32196821
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
--- End diff --

Maybe only come with big data learning tasks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197018
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
--- End diff --

Nicely done with the site version :+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197272
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
--- End diff --

Why creating a local environment? Why not using 
`ExecutionEnvironment.getExecutionEnvironment`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197381
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
+
+val survival = env.readCsvFile[(String, String, String, 
String)](/path/to/haberman.data)
+
+{% endhighlight %}
+
+We can now transform the data into a `DataSet[LabeledVector]`. This will 
allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th 
element of the dataset
+is the class label, and the rest are features, so we can build 
`LabeledVector` elements like this:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+
+val survivalLV = survival
+  .map{tuple =
+val list = tuple.productIterator.toList
+val numList = list.map(_.asInstanceOf[String].toDouble)
+

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197502
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
+
+val survival = env.readCsvFile[(String, String, String, 
String)](/path/to/haberman.data)
+
+{% endhighlight %}
+
+We can now transform the data into a `DataSet[LabeledVector]`. This will 
allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th 
element of the dataset
+is the class label, and the rest are features, so we can build 
`LabeledVector` elements like this:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+
+val survivalLV = survival
+  .map{tuple =
+val list = tuple.productIterator.toList
+val numList = list.map(_.asInstanceOf[String].toDouble)
+

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197542
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
--- End diff --

I am indeed very good at copy-pasting your code :P


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197549
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
+
+val survival = env.readCsvFile[(String, String, String, 
String)](/path/to/haberman.data)
+
+{% endhighlight %}
+
+We can now transform the data into a `DataSet[LabeledVector]`. This will 
allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th 
element of the dataset
+is the class label, and the rest are features, so we can build 
`LabeledVector` elements like this:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+
+val survivalLV = survival
+  .map{tuple =
+val list = tuple.productIterator.toList
+val numList = list.map(_.asInstanceOf[String].toDouble)
+

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197636
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
--- End diff --

This copy copied verbatime from the dataset description, I will change it 
to a study


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197679
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
--- End diff --

Good idea, I will use that instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197563
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
--- End diff --

Good catch, will change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197690
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
+
+val survival = env.readCsvFile[(String, String, String, 
String)](/path/to/haberman.data)
+
+{% endhighlight %}
+
+We can now transform the data into a `DataSet[LabeledVector]`. This will 
allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th 
element of the dataset
+is the class label, and the rest are features, so we can build 
`LabeledVector` elements like this:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+
+val survivalLV = survival
+  .map{tuple =
+val list = tuple.productIterator.toList
+val numList = list.map(_.asInstanceOf[String].toDouble)
+

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/792#issuecomment-111035938
  
Great work @thvasilo. I really like the quickstart guide. I had only some 
minor comments. Once they are addressed, it's good to be merged :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32198054
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
+
+val survival = env.readCsvFile[(String, String, String, 
String)](/path/to/haberman.data)
+
+{% endhighlight %}
+
+We can now transform the data into a `DataSet[LabeledVector]`. This will 
allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th 
element of the dataset
+is the class label, and the rest are features, so we can build 
`LabeledVector` elements like this:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+
+val survivalLV = survival
+  .map{tuple =
+val list = tuple.productIterator.toList
+val numList = list.map(_.asInstanceOf[String].toDouble)
+

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32198150
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
--- End diff --

Hmm I was not aware of this ;-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32198235
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
--- End diff --

Good catch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-11 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32198348
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
+(features) to a set of outputs. The learning is done using a *training 
set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the *class* that an example belongs to, for example whether a user 
is going to click on
+an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
+values, often called the dependent variable, for example what the 
temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
+
+{% highlight xml %}
+dependency
+  groupIdorg.apache.flink/groupId
+  artifactIdflink-ml/artifactId
+  version{{site.version }}/version
+/dependency
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *contains cases from study conducted on the survival of 
patients who had undergone
+surgery for breast cancer*. The data comes in a comma-separated file, 
where the first 3 columns
+are the features and last column is the class, and the 4th column 
indicates whether the patient
+survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
+page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+import org.apache.flink.api.scala.ExecutionEnvironment
+
+val env = ExecutionEnvironment.createLocalEnvironment(2)
+
+val survival = env.readCsvFile[(String, String, String, 
String)](/path/to/haberman.data)
+
+{% endhighlight %}
+
+We can now transform the data into a `DataSet[LabeledVector]`. This will 
allow us to use the
+dataset with the FlinkML classification algorithms. We know that the 4th 
element of the dataset
+is the class label, and the rest are features, so we can build 
`LabeledVector` elements like this:
+
+{% highlight scala %}
+
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.DenseVector
+
+val survivalLV = survival
+  .map{tuple =
+val list = tuple.productIterator.toList
+val numList = list.map(_.asInstanceOf[String].toDouble)
+

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-09 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32018559
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-09 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32008548
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896733
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896759
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896845
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897514
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
--- End diff --

Will add reference


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897546
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
--- End diff --

Yup, will remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897535
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
--- End diff --

Good catch, will add.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897818
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896248
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
--- End diff --

are the inputs called predictors?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896965
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896996
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897426
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
--- End diff --

It's more of a statistics terminology, see 
[synonyms](http://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms).
 In ML features is more common so I will change it to that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896535
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897731
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902170
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902243
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
--- End diff --

My thoughts were that we will provide the whole 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31902179
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/792#issuecomment-109920114
  
Great work @thvasilo. I like the quickstart guide a lot. There are only 
some minor comments I had. Then it's good to be merged :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896076
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
--- End diff --

period missing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896095
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
--- End diff --

missing link?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896684
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897350
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897308
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31899035
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896308
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
--- End diff --

Isnt' the TODO fixed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896876
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896948
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897278
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
--- End diff --

Maybe we should also give the imports the 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897648
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31898968
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31900406
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31909835
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31909808
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
+
+{% endhighlight %}
+
+This gives us a 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31909954
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
--- End diff --

Would that be a generic example for SVMs? 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31911936
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
--- End diff --

That's a good question. By making it generic we 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31925630
  
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
+the complexities that usually come with having to deal with big data 
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
+learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
+(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
+further divided into classification and regression problems. In 
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a 
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the 
data from the
+descriptive features. Unsupervised learning can also be used for feature 
selection, for example
+through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
+common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
+member which represents the label, which could be the class in a 
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
+[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)](/path/to/breast-cancer-wisconsin.data)
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those 
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+  .map(_.productIterator.toList)
+  .filter(!_.contains(?))
+  .map{list =
+val numList = list.map(_.asInstanceOf[String].toDouble)
+LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+}
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
+found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
+Let's import the Adult (a9a) dataset. You can download the 
+[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM(path/to/a8a)
+val adultTest = MLUtils.readLibSVM(path/to/a8a.t)
--- End diff --

Hmm for FlinkML it's probably ok to have 

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

2015-06-05 Thread thvasilo
GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/792

[FLINK-2072] [ml]  [docs] Add a quickstart guide for FlinkML

This is an initial version of the quickstart guide. There are some issues 
that still need to be addressed such as the validity of standardizing the data, 
and whether the complete code example should be included in an examples package 
for FlinkML.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink quickstart-ml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/792.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #792


commit 27487ec6089adbea77266f194582ae476e50e928
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-05T09:09:11Z

Initial version of quickstart guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---