[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-22 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/688#issuecomment-104550265
  
LGTM. Will merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/688


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread rmetzger
Github user rmetzger commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30783024
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
--- End diff --

maybe we should add a headline that makes clear that this is a roadmap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/688#issuecomment-104185370
  
Thank you for writing this. Finally I get a better understanding of the 
overall status of the FlinkML stuff ;)

+1 to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30784369
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
--- End diff --

Do you think, we need this paragraph here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/688#issuecomment-104193448
  
Really nice text @thvasilo. I think it's a great introduction.

But I also think that we should change a little bit the outline of our 
starting page. IMO, it should be governed by what a user expects from such a 
starting site. If I were a new user, I would expect something along the lines 
of:

1. Short introduction (your text is good if we condense the last paragraphs)
2. List of supported algorithms
3. Getting started (what dependencies to include etc.)
4. Tutorial/Example use case
5. Roadmap

I'm not so sure whether the list of algorithms should be number 2 but 
that's something to discuss. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30786608
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
+library with more algorithms. An example of how simple it is to create a 
learning model in
+FlinkML is given below:
--- End diff --

A separate example section is a good idea. Still, I would like to keep this 
very small example here, to make it clear that getting up and running with the 
library is just a few lines of code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30784160
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
+library with more algorithms. An example of how simple it is to create a 
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
+val data: DataSet[LabelVector] = ...
--- End diff --

`LabeledVector`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30784150
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
+library with more algorithms. An example of how simple it is to create a 
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
--- End diff --

typo: `LabeledVector`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30784183
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
+library with more algorithms. An example of how simple it is to create a 
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
+val data: DataSet[LabelVector] = ...
+
+val learner = MultipleLinearRegression()
+
+val parameters = ParameterMap()
+  .add(MultipleLinearRegression.Stepsize, 1.0)
+  .add(MultipleLinearRegression.Iterations, 10)
+  .add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
+
+val model = learner.fit(data, parameters)
--- End diff --

With the new pipelining, this has to be updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30784588
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
--- End diff --

A good documentation should be self-evident IMHO. Guess we can merge this 
paragraph with the previous, can't we?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30784829
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
+library with more algorithms. An example of how simple it is to create a 
learning model in
+FlinkML is given below:
--- End diff --

Good idea to show how to use FlinkML. I would extend the example a little 
bit and put it into a separate section. Maybe making a small tutorial out of it 
including the things you have to add to the your pom etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30784959
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
+library with more algorithms. An example of how simple it is to create a 
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
+val data: DataSet[LabelVector] = ...
+
+val learner = MultipleLinearRegression()
+
+val parameters = ParameterMap()
+  .add(MultipleLinearRegression.Stepsize, 1.0)
+  .add(MultipleLinearRegression.Iterations, 10)
+  .add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
+
+val model = learner.fit(data, parameters)
+{% endhighlight %}
+
+The roadmap below can provide an indication of the algorithms we aim to 
implement in the coming
+months. Items in **bold** have already been implemented:
+
+
+* Pipelines of transformers and learners
+* Data pre-processing
+  * **Feature scaling**
+  * **Polynomial feature base mapper**
+  * Feature hashing
+  * Feature extraction for text
+  * Dimensionality reduction
+* Model selection and performance evaluation
+  * Cross-validation for model selection and evaluation
+* Supervised learning
+  * Optimization framework
+* **Stochastic Gradient Descent**
+* L-BFGS
+  * Generalized Linear Models
+* **Multiple linear regression**
+* LASSO, Ridge regression
+* Multi-class Logistic regression
+  * Random forests
+  * **Support Vector Machines**
+* Unsupervised learning
+  * Clustering
+* K-means clustering
+  * PCA
+* Recommendation
+  * **ALS**
+* Text analytics
+  * LDA
+* Statistical estimation tools
+* Distributed linear algebra
+* Streaming ML
--- End diff --

I like the idea of a roadmap so that people see where we're steering. But I 
would probably put it on a separate page where we can also elaborate a little 
bit what distributed linear algebra means and how it integrates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30785023
  
--- Diff: docs/libs/ml/index.md ---
@@ -30,7 +122,7 @@ under the License.
 /dependency
 {% endhighlight %}
 
-## Algorithms
+## Algorithm Documentation
--- End diff --

I think, we should move the algorithms a little bit more prominently to the 
top of the page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30786490
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
--- End diff --

I think the reason this is here is to communicate to people that would also 
like to contribute to the library that we consider documentation an integral 
part of the library and not an after-thought.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30786639
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will 
continue to extend the
+library with more algorithms. An example of how simple it is to create a 
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
+val data: DataSet[LabelVector] = ...
+
+val learner = MultipleLinearRegression()
+
+val parameters = ParameterMap()
+  .add(MultipleLinearRegression.Stepsize, 1.0)
+  .add(MultipleLinearRegression.Iterations, 10)
+  .add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
+
+val model = learner.fit(data, parameters)
+{% endhighlight %}
+
+The roadmap below can provide an indication of the algorithms we aim to 
implement in the coming
+months. Items in **bold** have already been implemented:
+
+
+* Pipelines of transformers and learners
+* Data pre-processing
+  * **Feature scaling**
+  * **Polynomial feature base mapper**
+  * Feature hashing
+  * Feature extraction for text
+  * Dimensionality reduction
+* Model selection and performance evaluation
+  * Cross-validation for model selection and evaluation
+* Supervised learning
+  * Optimization framework
+* **Stochastic Gradient Descent**
+* L-BFGS
+  * Generalized Linear Models
+* **Multiple linear regression**
+* LASSO, Ridge regression
+* Multi-class Logistic regression
+  * Random forests
+  * **Support Vector Machines**
+* Unsupervised learning
+  * Clustering
+* K-means clustering
+  * PCA
+* Recommendation
+  * **ALS**
+* Text analytics
+  * LDA
+* Statistical estimation tools
+* Distributed linear algebra
+* Streaming ML
--- End diff --

That's a good idea, I'll add a link to the roadmap instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30786723
  
--- Diff: docs/libs/ml/index.md ---
@@ -30,7 +122,7 @@ under the License.
 /dependency
 {% endhighlight %}
 
-## Algorithms
+## Algorithm Documentation
--- End diff --

Yeah this becomes a bit problematic now. This text is not really what you 
expect from the index of a library, so maybe it should be moved to its own 
section (Vision and roadmap?) with only a very brief introduction remaining on 
the index.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30787594
  
--- Diff: docs/libs/ml/index.md ---
@@ -30,7 +122,7 @@ under the License.
 /dependency
 {% endhighlight %}
 
-## Algorithms
+## Algorithm Documentation
--- End diff --

I agree. The text would also make a good announcing blog post.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/688#issuecomment-104209821
  
5. How to contribute


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30787920
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
--- End diff --

I meant only the paragraph with the inspiration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30787862
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data 
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan 
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data 
streams.
+
+FlinkML will allow data scientists to test their models locally and using 
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a 
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in 
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML 
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that 
scale with problem and
+cluster sizes.
--- End diff --

Hmm, I don't know whether it's not enough to state it only in the code. Did 
@ktzoumas say that he wants that in the introductory text?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/688#discussion_r30787534
  
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
 under the License.
 --
 
+The Machine Learning (ML) library for Flink is a new effort to bring 
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is 
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes 
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue 
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML 
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem 
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily 
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without 
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end 
we will be providing
+detailed documentation along with examples for every part of the system. 
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using 
familiar programming
+concepts and terminology.
--- End diff --

Then we should add another section to the outline: How to contribute 
where we state this. That's maybe also a good place to put the how to 
implement a new pipeline operator with the implicit classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-21 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/688#issuecomment-104282245
  
Added all the changes we discussed, this should be good to merge now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-19 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/688#issuecomment-103413605
  
Pinging @tillrohrmann to take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2034] [docs] Add vision and roadmap for...

2015-05-18 Thread thvasilo
GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/688

[FLINK-2034] [docs]  Add vision and roadmap for ML library to docs

We should have a document describing the vision of the Machine Learning 
library in Flink and an up to date roadmap.
This PR provides that, and the text can also be used in the website.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink ml-docs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/688.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #688


commit b5808bcc80f30bbc4e70dd91b0cfda47947a731d
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-05-18T13:52:56Z

Added vision and roadmap to ML docs.

Also added attribution for some of the Latex in optimization framework.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---