[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393293#comment-15393293
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user thvasilo commented on the issue:

https://github.com/apache/flink/pull/1220
  
Hello Daniel, sorry to bring this up months later,but I see that while the 
documentation exists, there is nothing linking to it from the FlinkML index 
page. Would you care to create a new PR linking to the docs from the FlinkML 
docs landing page? Feel free to create an unsupervised learning category for 
this.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
> Fix For: 1.1.0
>
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306720#comment-15306720
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-222507122
  
@chiwanpark the formatting did not work, see 
[screenshot](https://www.dropbox.com/s/psrercxcikozjgd/Screenshot%202016-05-30%2010.38.48.png?dl=0)
 in my community edition of IntelliJ.  According to the PR, there should be a 
"Import from IntelliJ IDEA code style XML" option.  I'm going to paste this 
same comment in the PR


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
> Fix For: 1.1.0
>
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306631#comment-15306631
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

GitHub user danielblazevski opened a pull request:

https://github.com/apache/flink/pull/2050

[Flink-1934] Add approximative k-nearest-neighbours (kNN) algorithm to 
machine learning library

I added approximate knn algorithms. In another PR, there are two exact 
methods, one basic algorithms using a prirority queue and another using a 
quadtree (see: #1220 ).

For this PR, I added z-value based knn and LSH (Locality Sensitive Hashing) 
based knn. Z-values are good for low-to-moderate dimension. For details, see 
the paper [2] someone put on the exact JIRA issue: 
https://issues.apache.org/jira/browse/FLINK-1745
https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf

The z-value approach isn't applicable for larger dimensions, so I used -- 
as the paper suggests -- a more standard LSH approach.

The paper describes a fairly sophisticated MapReduce (MR) design, which I 
did not use. Using the same MR design pattern as the exact method, I found 
really good performance improvement! In JIRA, I ran this by @tillrohrmann, and 
he was OK with a less optimized version for now. Here is a link for a talk I 
recently gave on this, which includes links for the video and slides:
http://www.meetup.com/ny-scala/events/231163636/

Because both the LSH and z-value use the same MR design pattern as the 
exact versions, I reformatted the codebase from the PR for exact version a bit 
to make it more modular.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/danielblazevski/flink FLINK-1934

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2050.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2050


commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park 
Date:   2015-06-30T08:41:25Z

[FLINK-1745] [ml] Add exact k-nearest-neighbor join

commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski 
Date:   2015-09-15T21:49:05Z

modified kNN test to familiarize with Flink and KNN.scala

commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski 
Date:   2015-09-15T21:49:17Z

modified kNN test to familiarize with Flink and KNN.scala

commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski 
Date:   2015-09-16T01:26:57Z

nightly commit, minor changes:  got the filter to work, working on mapping 
the training set to include box lables

commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski 
Date:   2015-09-22T02:02:51Z

commit before incporporating quadtree

commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski 
Date:   2015-09-22T15:49:25Z

did a basic import of QuadTree and Test; to-do:  modify QuadTree to allow 
KNN.scala to make use of

commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski 
Date:   2015-09-22T21:03:04Z

transfered ListBuffers for objects in leaf nodes to Vectors

commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski 
Date:   2015-09-23T00:41:29Z

basic test on 2D unit box seems to work -- need to generalize, e.g. to 
include automated bounding box

commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski 
Date:   2015-09-23T15:03:06Z

had to debug quadtree -- back to testing 2D

commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski 
Date:   2015-09-23T15:22:46Z

Testing 2D looks good, strong improvement in run time compared to 
brute-force method

commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski 
Date:   2015-09-23T19:04:52Z

added automated detection of bounding box based on min/max values of both 
training and test sets

commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski 
Date:   2015-09-24T00:28:56Z

added automated radius about test point to define localized neighborhood, 
result runs.  TO-DO:  Lots of tests

commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski 
Date:   2015-09-24T00:54:06Z

Note for future:  previous commit passed test of Chiwan Park had in intial 
knn implementation

commit d6fd40cb88d6e198e52c368e829bf7d32d432081
Author: danielblazevski 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306630#comment-15306630
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski closed the pull request at:

https://github.com/apache/flink/pull/2048


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
> Fix For: 1.1.0
>
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306619#comment-15306619
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

GitHub user danielblazevski reopened a pull request:

https://github.com/apache/flink/pull/2048

[Flink 1934] Add approximative k-nearest-neighbours (kNN) algorithm to 
machine learning library

I added approximate knn algorithms.  In another PR, there are two exact 
methods, one basic algorithms using a prirority queue and another using a 
quadtree (see: https://github.com/apache/flink/pull/1220 ).  

For this PR, I added z-value based knn and LSH (Locality Sensitive Hashing) 
based knn.  Z-values are good for low-to-moderate dimension.  For details, see 
the paper [2] someone put on the exact JIRA issue: 
https://issues.apache.org/jira/browse/FLINK-1745
https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf

The z-value approach isn't applicable for larger dimensions, so I used -- 
as the paper suggests -- a more standard LSH approach. 

The paper describes a fairly sophisticated MapReduce (MR) design, which I 
did not use.  Using the same MR design pattern as the exact method, I found 
really good performance improvement!  In JIRA, I ran this by @tillrohrmann, and 
he was OK with a less optimized version for now. Here is a link for a talk I 
recently gave on this, which includes links for the video and slides:
http://www.meetup.com/ny-scala/events/231163636/

Because both the LSH and z-value use the same MR design pattern as the 
exact versions, I reformatted the codebase from the PR for exact version a bit 
to make it more modular.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/danielblazevski/flink FLINK-1934

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2048


commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park 
Date:   2015-06-30T08:41:25Z

[FLINK-1745] [ml] Add exact k-nearest-neighbor join

commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski 
Date:   2015-09-15T21:49:05Z

modified kNN test to familiarize with Flink and KNN.scala

commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski 
Date:   2015-09-15T21:49:17Z

modified kNN test to familiarize with Flink and KNN.scala

commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski 
Date:   2015-09-16T01:26:57Z

nightly commit, minor changes:  got the filter to work, working on mapping 
the training set to include box lables

commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski 
Date:   2015-09-22T02:02:51Z

commit before incporporating quadtree

commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski 
Date:   2015-09-22T15:49:25Z

did a basic import of QuadTree and Test; to-do:  modify QuadTree to allow 
KNN.scala to make use of

commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski 
Date:   2015-09-22T21:03:04Z

transfered ListBuffers for objects in leaf nodes to Vectors

commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski 
Date:   2015-09-23T00:41:29Z

basic test on 2D unit box seems to work -- need to generalize, e.g. to 
include automated bounding box

commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski 
Date:   2015-09-23T15:03:06Z

had to debug quadtree -- back to testing 2D

commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski 
Date:   2015-09-23T15:22:46Z

Testing 2D looks good, strong improvement in run time compared to 
brute-force method

commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski 
Date:   2015-09-23T19:04:52Z

added automated detection of bounding box based on min/max values of both 
training and test sets

commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski 
Date:   2015-09-24T00:28:56Z

added automated radius about test point to define localized neighborhood, 
result runs.  TO-DO:  Lots of tests

commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski 
Date:   2015-09-24T00:54:06Z

Note for future:  previous commit passed test of Chiwan Park had in intial 
knn implementation

commit d6fd40cb88d6e198e52c368e829bf7d32d432081

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306592#comment-15306592
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-222482793
  
@danielblazevski I'm using a custom configuration of IntelliJ code 
formatter. There is a pending [pull 
request](https://github.com/apache/flink/pull/1963/files) about code formatting 
in IntelliJ. This might be helpful.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
> Fix For: 1.1.0
>
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306586#comment-15306586
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-222481628
  
Thanks @chiwanpark !  I saw that you changed the formatting of the code.  
Did you automatically do this in IntelliJ?  I've been using `cmd + alt + shift 
+ L` in IntelliJ, and the formatting is a bit different (and not as nice).


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
> Fix For: 1.1.0
>
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306533#comment-15306533
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/1220


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306163#comment-15306163
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski closed the pull request at:

https://github.com/apache/flink/pull/2048


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306122#comment-15306122
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

GitHub user danielblazevski opened a pull request:

https://github.com/apache/flink/pull/2048

[Flink 1934] Add approximative k-nearest-neighbours (kNN) algorithm to 
machine learning library

I added approximate knn algorithms.  In another PR, there are two exact 
methods, one basic algorithms using a prirority queue and another using a 
quadtree (see: https://github.com/apache/flink/pull/1220 ).  

For this PR, I added z-value based knn and LSH (Locality Sensitive Hashing) 
based knn.  Z-values are good for low-to-moderate dimension.  For details, see 
the paper [2] someone put on the exact JIRA issue: 
https://issues.apache.org/jira/browse/FLINK-1745
https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf

As the z-values aren't applicable for larger dimension, so I used -- as the 
paper suggests -- a more standard LSH approach. 

The paper describes a fairly sophisticated MapReduce (MR) design, which I 
did not use.  Using the same MR design pattern as the exact method, I found 
really good performance improvement!  In JIRA, I ran this by @tillrohrmann, and 
he was OK with a less optimized version for now. Here is a link for a talk I 
recently gave on this, which includes links for the video and slides:
http://www.meetup.com/ny-scala/events/231163636/

Because both the LSH and z-value use the same MR design pattern, I 
reformatted the codebase from the PR for exact version a bit to make it more 
modular.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/danielblazevski/flink FLINK-1934

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/2048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2048


commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park 
Date:   2015-06-30T08:41:25Z

[FLINK-1745] [ml] Add exact k-nearest-neighbor join

commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski 
Date:   2015-09-15T21:49:05Z

modified kNN test to familiarize with Flink and KNN.scala

commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski 
Date:   2015-09-15T21:49:17Z

modified kNN test to familiarize with Flink and KNN.scala

commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski 
Date:   2015-09-16T01:26:57Z

nightly commit, minor changes:  got the filter to work, working on mapping 
the training set to include box lables

commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski 
Date:   2015-09-22T02:02:51Z

commit before incporporating quadtree

commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski 
Date:   2015-09-22T15:49:25Z

did a basic import of QuadTree and Test; to-do:  modify QuadTree to allow 
KNN.scala to make use of

commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski 
Date:   2015-09-22T21:03:04Z

transfered ListBuffers for objects in leaf nodes to Vectors

commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski 
Date:   2015-09-23T00:41:29Z

basic test on 2D unit box seems to work -- need to generalize, e.g. to 
include automated bounding box

commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski 
Date:   2015-09-23T15:03:06Z

had to debug quadtree -- back to testing 2D

commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski 
Date:   2015-09-23T15:22:46Z

Testing 2D looks good, strong improvement in run time compared to 
brute-force method

commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski 
Date:   2015-09-23T19:04:52Z

added automated detection of bounding box based on min/max values of both 
training and test sets

commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski 
Date:   2015-09-24T00:28:56Z

added automated radius about test point to define localized neighborhood, 
result runs.  TO-DO:  Lots of tests

commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski 
Date:   2015-09-24T00:54:06Z

Note for future:  previous commit passed test of Chiwan Park had in intial 
knn implementation

commit d6fd40cb88d6e198e52c368e829bf7d32d432081
Author: danielblazevski 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291134#comment-15291134
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63882747
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,353 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  * val trainingDS: DataSet[Vector] = ...
+  * val testingDS: DataSet[Vector] = ...
+  *
+  * val knn = KNN()
+  *   .setK(10)
+  *   .setBlocks(5)
+  *   .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  * knn.fit(trainingDS)
+  *
+  * val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15290890#comment-15290890
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63856740
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
   

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15290321#comment-15290321
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63815357
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,353 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  * val trainingDS: DataSet[Vector] = ...
+  * val testingDS: DataSet[Vector] = ...
+  *
+  * val knn = KNN()
+  *   .setK(10)
+  *   .setBlocks(5)
+  *   .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  * knn.fit(trainingDS)
+  *
+  * val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289259#comment-15289259
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-220083900
  
The PR looks good to me. The only think which could be good to get rid of 
is the requirement that you have to select a Euclidean distance for the 
quadtree. Maybe there is some other characteristic for a distance measure which 
says whether it's applicable for quadtrees or not. Then we could introduce a 
new distance metric type to make sure that only appropriate distance measures 
are used. But this should not be a blocker for merging this PR. 

Thanks for your contribution @danielblazevski. Really good work :-)


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289208#comment-15289208
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63732878
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric match 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289209#comment-15289209
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63733018
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric match 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289051#comment-15289051
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63712405
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289072#comment-15289072
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-220047418
  
Thanks @tillrohrmann, made changes as per your suggestions.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289065#comment-15289065
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63714951
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
--- End diff --

Indeed, nice!


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289053#comment-15289053
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63712701
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289060#comment-15289060
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63713235
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289057#comment-15289057
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63712972
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289049#comment-15289049
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63712157
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289048#comment-15289048
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63712033
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289045#comment-15289045
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63711829
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289035#comment-15289035
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63710693
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289028#comment-15289028
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63709829
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289022#comment-15289022
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63709449
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289014#comment-15289014
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63708494
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,145 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
--- End diff --

done


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289002#comment-15289002
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63707341
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric match 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288993#comment-15288993
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63706266
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric match 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288978#comment-15288978
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63704597
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric match 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288977#comment-15288977
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63704440
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric match 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288959#comment-15288959
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63702930
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
   

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288970#comment-15288970
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63703823
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,352 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+/** Tests whether the queryPoint is in the node, or a child of that 
node
+  *
+  * @param queryPoint
+  * @return
+  */
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
--- End diff --

this condition could written more succinctly via `(0 until 
queryPoint.size).forall{...}`


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288960#comment-15288960
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63703023
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
   

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288955#comment-15288955
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63702415
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
   

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288949#comment-15288949
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63701881
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
   

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288930#comment-15288930
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63698942
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k`-nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
   

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288925#comment-15288925
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63698367
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,145 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neighbors to search for.  That 
is, for each test point, the algorithm finds the K-nearest neighbors in the 
training set
+(Default value: 5)
+  
+
+  
+  
+DistanceMetric
+
+  
+Sets the distance metric we use to calculate the distance 
between two points. If no metric is specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+(Default value: EuclideanDistanceMetric)
+  
+
+  
+  
+Blocks
+
+  
+Sets the number of blocks into which the input data will be 
split. This number should be set
+at least to the degree of parallelism. If no value is 
specified, then the parallelism of the
+input [[DataSet]] is used as the number of blocks.
+(Default value: None)
+  
+
+  
+  
+UseQuadTreeParam
+
+  
+ A boolean variable that whether or not to use a Quadtree to 
partition the training set to potentially simplify the KNN search.  If no value 
is specified, the code will automatically decide whether or not to use a 
Quadtree.  Use of a Quadtree scales well with the number of training and 
testing points, though poorly with the dimension.
+(Default value: None)
--- End diff --

Sorry my bad. Didn't read properly your description.



> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288923#comment-15288923
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63698264
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,145 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neighbors to search for.  That 
is, for each test point, the algorithm finds the K-nearest neighbors in the 
training set
+(Default value: 5)
+  
+
+  
+  
+DistanceMetric
+
+  
+Sets the distance metric we use to calculate the distance 
between two points. If no metric is specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+(Default value: EuclideanDistanceMetric)
+  
+
+  
+  
+Blocks
+
+  
+Sets the number of blocks into which the input data will be 
split. This number should be set
+at least to the degree of parallelism. If no value is 
specified, then the parallelism of the
+input [[DataSet]] is used as the number of blocks.
+(Default value: None)
+  
+
+  
+  
+UseQuadTreeParam
+
+  
+ A boolean variable that whether or not to use a Quadtree to 
partition the training set to potentially simplify the KNN search.  If no value 
is specified, the code will automatically decide whether or not to use a 
Quadtree.  Use of a Quadtree scales well with the number of training and 
testing points, though poorly with the dimension.
+(Default value: None)
--- End diff --

A boolean value should be either true or false


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288914#comment-15288914
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r63697916
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,145 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
--- End diff --

`k` missing in first `KNN(b, A, k)`


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-05-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286499#comment-15286499
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-219702506
  
Great to hear that z-knn is almost done! If you think the implementation 
has good shape, do not hesitate to open a pull request.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261945#comment-15261945
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-215384494
  
@chiwanpark Thanks for the comments.  I made all the changes except making 
`makeChildren` private since that is in the `Node` class and is called in the 
`Quadtree` class outside of the `Node` class.  Since it's still public, I added 
Scaladocs as per your comment.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261882#comment-15261882
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-215369160
  
@danielblazevski Sorry for late check. I check your PR and have few 
comments. After addressing, I would like to merge this.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261852#comment-15261852
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61398228
  
--- Diff: 
flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/classification/Classification.scala
 ---
@@ -131,3 +131,6 @@ object Classification {
 
   val expectedWeightVector = DenseVector(-1.95, -3.45)
 }
+
+
+
--- End diff --

Are these new lines necessary?


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261850#comment-15261850
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61398096
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  val count = (0 until queryPoint.size).filter { i =>
+(queryPoint(i) - radius < center(i) + width(i) / 2) &&
+  (queryPoint(i) + radius > center(i) - width(i) / 2)
+  }.size
+
+  count == queryPoint.size
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(
+  queryPoint: Vector,
+  radius: Double): Boolean = {
+  minDist(queryPoint) < radius
+}
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+def minDist(queryPoint: Vector): Double = {
+  val minDist = (0 until queryPoint.size).map { i =>
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) + width(i) / 2, 2)
+} else if (queryPoint(i) > center(i) + width(i) / 2) {
+  math.pow(queryPoint(i) - center(i) - width(i) / 2, 2)
+} else {
+  0
+}
+  }.sum
+
+  distMetric match {
+case _: SquaredEuclideanDistanceMetric => minDist
+case _: EuclideanDistanceMetric => math.sqrt(minDist)
+case _ => throw 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261849#comment-15261849
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61398006
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala 
---
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(
+  minVec: Vector,
+  maxVec: Vector,
+  distMetric: DistanceMetric,
+  maxPerBox: Int) {
+
+  class Node(
+center: Vector,
+width: Vector,
+var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
--- End diff --

Could you add a scaladoc for this method? All public methods should have a 
scaladoc.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261845#comment-15261845
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397870
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261842#comment-15261842
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397754
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.UseQuadTreeParam]]
+  * A boolean variable that whether or not to use a Quadtree to partition 
the training set
+  * to potentially simplify the KNN search.  If no value is specified, the 
code will
+  * automatically decide whether or not to use a Quadtree.  Use of a 
Quadtree scales well
+  * with the number of training and testing points, though poorly with the 
dimension.
+  * (Default value:  ```None```)
+  *
+  * - [[org.apache.flink.ml.nn.KNN.SizeHint]]
+  * Specifies whether the training set or test set is small to optimize 
the cross
+  * product operation needed for the KNN search.  If the training set is 
small
+  * this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL`
+  * if the test set is small.
+  * (Default value:  ```None```)
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261839#comment-15261839
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397659
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric, 
DistanceMetric,
+EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
--- End diff --

Calculates the _`k`-nearest_ neighbor points ...


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261836#comment-15261836
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397574
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,146 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neoghbors to search for.  That 
is, for each test point, the algorithm finds the K nearest neighbors in the 
training set
+(Default value: 5)
+  
+
+  
+  
+ DistanceMetric
+
+  
+Sets the distance metric we use to calculate the distance 
between two points. If no metric is specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+(Default value:  EuclideanDistanceMetric )
+  
+
+  
+  
+Blocks
+
+  
+Sets the number of blocks into which the input data will be 
split. This number should be set
+at least to the degree of parallelism. If no value is 
specified, then the parallelism of the
+input [[DataSet]] is used as the number of blocks.
+(Default value: None)
+  
+
+  
+  
+UseQuadTreeParam
+
+  
+ A boolean variable that whether or not to use a Quadtree to 
partition the training set to potentially simplify the KNN search.  If no value 
is specified, the code will automatically decide whether or not to use a 
Quadtree.  Use of a Quadtree scales well with the number of training and 
testing points, though poorly with the dimension.
+(Default value: None)
+  
+
+  
+  
+SizeHint
+
+  Specifies whether the training set or test set is small to 
optimize the cross product operation needed for the KNN search.  If the 
training set is small this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL` if the test set is small.
+ (Default value: None)
+  
+
+  
+
+  
+
+## Examples
+
+{% highlight scala %}
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.classification.Classification
+import org.apache.flink.ml.math.DenseVector
+import org.apache.flink.ml.metrics.distances.
+SquaredEuclideanDistanceMetric
+
+  val env = ExecutionEnvironment.getExecutionEnvironment
+
+  // prepare data
+  val trainingSet = 
env.fromCollection(Classification.trainingData).map(_.vector)
+  val testingSet = env.fromElements(DenseVector(0.0, 0.0))
+
+ val knn = KNN()
+.setK(3)
+.setBlocks(10)
+.setDistanceMetric(SquaredEuclideanDistanceMetric())
+

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261834#comment-15261834
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397472
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,146 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neoghbors to search for.  That 
is, for each test point, the algorithm finds the K nearest neighbors in the 
training set
+(Default value: 5)
+  
+
+  
+  
+ DistanceMetric
+
+  
+Sets the distance metric we use to calculate the distance 
between two points. If no metric is specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+(Default value:  EuclideanDistanceMetric )
+  
+
+  
+  
+Blocks
+
+  
+Sets the number of blocks into which the input data will be 
split. This number should be set
+at least to the degree of parallelism. If no value is 
specified, then the parallelism of the
+input [[DataSet]] is used as the number of blocks.
+(Default value: None)
+  
+
+  
+  
+UseQuadTreeParam
+
+  
+ A boolean variable that whether or not to use a Quadtree to 
partition the training set to potentially simplify the KNN search.  If no value 
is specified, the code will automatically decide whether or not to use a 
Quadtree.  Use of a Quadtree scales well with the number of training and 
testing points, though poorly with the dimension.
+(Default value: None)
+  
+
+  
+  
+SizeHint
+
+  Specifies whether the training set or test set is small to 
optimize the cross product operation needed for the KNN search.  If the 
training set is small this should be `CrossHint.FIRST_IS_SMALL` and set to 
`CrossHint.SECOND_IS_SMALL` if the test set is small.
+ (Default value: None)
+  
+
+  
+
+  
+
+## Examples
+
+{% highlight scala %}
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.classification.Classification
+import org.apache.flink.ml.math.DenseVector
+import org.apache.flink.ml.metrics.distances.
+SquaredEuclideanDistanceMetric
--- End diff --

Could you move this line to end of previous line?


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261833#comment-15261833
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397325
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,146 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neoghbors to search for.  That 
is, for each test point, the algorithm finds the K nearest neighbors in the 
training set
+(Default value: 5)
+  
+
+  
+  
+ DistanceMetric
+
+  
+Sets the distance metric we use to calculate the distance 
between two points. If no metric is specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+(Default value:  EuclideanDistanceMetric )
--- End diff --

Please remove space before EuclideanDistanceMetric and after it.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261832#comment-15261832
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397260
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,146 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neoghbors to search for.  That 
is, for each test point, the algorithm finds the K nearest neighbors in the 
training set
+(Default value: 5)
+  
+
+  
+  
+ DistanceMetric
--- End diff --

Please remove space before DistanceMetric


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261829#comment-15261829
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397153
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,146 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neoghbors to search for.  That 
is, for each test point, the algorithm finds the K nearest neighbors in the 
training set
--- End diff --

Defines the number of nearest-_neighbors_ ...


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261831#comment-15261831
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r61397191
  
--- Diff: docs/libs/ml/knn.md ---
@@ -0,0 +1,146 @@
+---
+mathjax: include
+htmlTitle: FlinkML - k-nearest neighbors
+title: FlinkML - knn
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+Implements an exact k-nearest neighbors algorithm.  Given a training set 
$A$ and a testing set $B$, the algorithm returns
+
+$$
+KNN(A,B, k) = \{ \left( b, KNN(b,A) \right) where b \in B and KNN(b, A, k) 
are the k-nearest points to b in A \}
+$$
+
+The brute-force approach is to compute the distance between every training 
and testing point.  To ease the brute-force computation of computing the 
distance between every traning point a quadtree is used.  The quadtree scales 
well in the number of training points, though poorly in the spatial dimension.  
The algorithm will automatically choose whether or not to use the quadtree, 
though the user can override that decision by setting a parameter to force use 
or not use a quadtree. 
+
+##Operations
+
+`KNN` is a `Predictor`. 
+As such, it supports the `fit` and `predict` operation.
+
+### Fit
+
+KNN is trained given a set of `LabeledVector`:
+
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Predict
+
+KNN predicts for all subtypes of FlinkML's `Vector` the corresponding 
class label:
+
+* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where 
the `(T, Array[Vector])` tuple
+  corresponds to (testPoint, K-nearest training points)
+
+## Paremeters
+The KNN implementation can be controlled by the following parameters:
+
+   
+
+  
+Parameters
+Description
+  
+
+
+
+  
+K
+
+  
+Defines the number of nearest-neoghbors to search for.  That 
is, for each test point, the algorithm finds the K nearest neighbors in the 
training set
--- End diff --

the algorithm finds the _K-nearest_ neighbors ...


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243164#comment-15243164
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-210527093
  
@tillrohrmann @chiwanpark does the re-basing look OK now?  Some of the CI 
builds didn't go through, 2 passed, 2 failed and 1 timed out (it seems there is 
a 2hr max limit for the CI process).


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241275#comment-15241275
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-209974474
  
@chiwanpark I added those file, forgot to run a `git add` statement.  A 
couple of other files were added to the `flink-staging` directory, perhaps as a 
result of rebasing.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240878#comment-15240878
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-209854438
  
Hi @danielblazevski, sorry for late reply. I checked your updated PR but 
your last commit (d6f90ce) seems wrong. The commit removes KNN.scala, 
QuadTree.scala, KNNITSuite.scala, and QuadTreeSuite.scala. Could you check 
again?


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233389#comment-15233389
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user hsaputra commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-207720551
  
Well, it seems like Travis like it =)


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233304#comment-15233304
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-207685012
  
@hsaputra I added apache/flink as upstream, namely:
`git remote add upstream https://github.com/apache/flink.git`
Then I ran what Chiwan above suggested, namely:
```
# fetch updated master branch
git fetch upstream master
# checkout local master branch
git checkout master 
# merge local master branch and upstream master branch (this should be 
fast-forward merge.)
git merge upstream/master
# checkout local FLINK-1745 branch
git checkout FLINK-1745
# rebase FLINK-1745 on local master branch
git rebase master
# force push local FLINK-1745 branch to github's FLINK-1745 branch
git push origin +FLINK-1745
```
I then moved the 4 knn files originally in flink-staging/ to 
flink-libraries/ and pushed again. 

The unfortunate thing now is that when I run `mvn clean package 
-DskipTests` I get errors (I can show you if you'd likebut I assume the 
Travic CI build won't go through and the error will pop up there too).  Did I 
do something wrong?  The good news is that I made a copy of the directory that 
I was working in since I've had rebasing problems before, so I can always try 
to go back to that and do a force push.

I wonder since I'm only adding new files whether it's even easier to just 
clone `apache/master`, run `mvn clean package -DskipTests` put the new files in 
there and submit a new PR?



> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-04-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232661#comment-15232661
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user hsaputra commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-207547199
  
@danielblazevski : Sorry, but could you help rebase the conflicts for this 
PR? Thanks


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210282#comment-15210282
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-200854252
  
@tillrohrmann @chiwanpark done, polished up KNN.scala and some minor 
changes -- e.g. expanding the description of the parameters in the beginning of 
KNN.scala.  

Looking forward to doing the approximate version.  I ran some tests last 
week of the pure Scala z-value KNN and it looks promising 
(https://github.com/danielblazevski/zknn-scala)


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208921#comment-15208921
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-200480609
  
@tillrohrmann thanks! I'll polish up knn.md soon


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208861#comment-15208861
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-200465211
  
Hi Daniel, the docs are handwritten as far as I know.

On Wed, Mar 23, 2016 at 5:49 PM, Daniel Blazevski 
wrote:

> @chiwanpark  about the docs, when I look
> at docs/libs/ml/smv.md for instance (or als.md, etc.), the parameters
> section seems auto-generated, is that correct? Is so, do you know how this
> was auto-generated? If not, I'll use the existing docs as a template.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly or view it on GitHub
> 
>



> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208753#comment-15208753
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-200436234
  
@chiwanpark about the docs, when I look at docs/libs/ml/smv.md for instance 
(or als.md, etc.), the `parameters` section seems auto-generated, is that 
correct?  Is so, do you know how this was auto-generated? If not, I'll use the 
existing docs as a template.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205848#comment-15205848
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-199648335
  
Hi @danielblazevski, thanks for update! Looks good to me for 
implementation. (Some minor issues and rebasing will be addressed by me.)

About docs, I meant we need to add description, examples and meaning of 
parameters to documentation in our homepage (`docs/libs/ml/knn.md`).



> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195486#comment-15195486
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-196879970
  
@chiwanpark for the docs, it looks like you use an auto-generated markdown 
file from the source code?  i.e. looks like something analogous to `scaladoc 
myScalaFile.scala` but for markdown.  The `Parameters` section in the ml-docs 
seems especially auto-generated


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185789#comment-15185789
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski closed the pull request at:

https://github.com/apache/flink/pull/1220


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185790#comment-15185790
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

GitHub user danielblazevski reopened a pull request:

https://github.com/apache/flink/pull/1220

[FLINK-1745] Add exact k-nearest-neighbours algorithm to machine learning 
library

I added a quadtree data structure for the knn algorithm.  @chiwanpark made 
originally made a pull request for a kNN algorithm, and we coordinated so that 
I incorporate a tree structure. The quadtree scales very well with the number 
of training + test points, but scales poorly with the dimension (even the 
R-tree scales poorly with the dimension). I added a flag that is automatically 
determines whether or not to use the quadtree. My implementation needed to use 
the Euclidean or SquaredEuclidean distance since I needed a specific notion of 
the distance between a test point and a box in the quadtree. I added another 
test KNNQuadTreeSuite in addition to Chiwan Park's KNNITSuite, since C. Park's 
parameters will automatically choose the brute-force non-quadtree method.

For more details on the quadtree + how I used it for the KNN query, please 
see another branch I created that has a README.md:

https://github.com/danielblazevski/flink/tree/FLINK-1745-devel/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/danielblazevski/flink FLINK-1745

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/1220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1220


commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park 
Date:   2015-06-30T08:41:25Z

[FLINK-1745] [ml] Add exact k-nearest-neighbor join

commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski 
Date:   2015-09-15T21:49:05Z

modified kNN test to familiarize with Flink and KNN.scala

commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski 
Date:   2015-09-15T21:49:17Z

modified kNN test to familiarize with Flink and KNN.scala

commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski 
Date:   2015-09-16T01:26:57Z

nightly commit, minor changes:  got the filter to work, working on mapping 
the training set to include box lables

commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski 
Date:   2015-09-22T02:02:51Z

commit before incporporating quadtree

commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski 
Date:   2015-09-22T15:49:25Z

did a basic import of QuadTree and Test; to-do:  modify QuadTree to allow 
KNN.scala to make use of

commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski 
Date:   2015-09-22T21:03:04Z

transfered ListBuffers for objects in leaf nodes to Vectors

commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski 
Date:   2015-09-23T00:41:29Z

basic test on 2D unit box seems to work -- need to generalize, e.g. to 
include automated bounding box

commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski 
Date:   2015-09-23T15:03:06Z

had to debug quadtree -- back to testing 2D

commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski 
Date:   2015-09-23T15:22:46Z

Testing 2D looks good, strong improvement in run time compared to 
brute-force method

commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski 
Date:   2015-09-23T19:04:52Z

added automated detection of bounding box based on min/max values of both 
training and test sets

commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski 
Date:   2015-09-24T00:28:56Z

added automated radius about test point to define localized neighborhood, 
result runs.  TO-DO:  Lots of tests

commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski 
Date:   2015-09-24T00:54:06Z

Note for future:  previous commit passed test of Chiwan Park had in intial 
knn implementation

commit d6fd40cb88d6e198e52c368e829bf7d32d432081
Author: danielblazevski 
Date:   2015-09-24T01:56:38Z

Note for future:  previous commit passed 3D version of the test that Chiwan 
Park had in the intial knn implementation

commit 0ec1f4866157ca073341672e7fe9a50871ac0b7c
Author: 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185785#comment-15185785
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-193964375
  
Hi @chiwanpark, I modified the tests and corrected the package + import 
statements, please have a look.  

I will add more details soon



> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184308#comment-15184308
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-193580086
  
Hi @danielblazevski, thanks for update and sorry for late reply. I tried to 
test your implementation and have found few things to do before merging this.

First, It is about test cases. I think we should add a test case for KNN 
with quad-tree rather than modifying a test case without quad-tree. Also we 
need some test cases with non-executable configuration such as KNN with 
quad-tree and non-compatible distance metric. A method to create a test case 
with exceptions is described in scalatest documentation (**Intercepted 
exceptions** section in  http://www.scalatest.org/user_guide/using_assertions).

Second, package definitions of `QuadTree` and `QuadTreeSuite` are not 
matched with directory structure.

Finally, I think we need to add more detail description with some 
mathematical background of KNN and quad-tree (including link of your slides and 
papers which you referred to) to the documentation. Also we need examples and  
description of parameters with default value.

About rebasing, if you set `apache/flink` as remote `apache`, you can apply 
commands I suggested with renaming `upstream` to `apache`. You don't need to 
worry during rebasing. I also have copied branch of your `FLINK-1745` branch in 
my local machine. If you have some problems with rebasing, I'll rebase on 
`apache/master`.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149198#comment-15149198
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-184853357
  
P.S. about rebasing, need to be careful, something went wrong the first 
time around.  I actually just started working on a new laptop, and started the 
git repo "from scratch" as follows:
```
clone the master and FLINK-1745 branches of my fork of Flink
checkout FLINK-1745, commit and push to origin (origin = my fork)
```

I  set upstream to `origin`, is that a mistake?  Namely, when I push 
locally to GitHub, I set `upstream` to `origin`, namely I ran:
```
git push --set-upstream origin FLINK-1745
```
`origin` is my fork.  Should I re-do this by adding a new `remote` called 
`apache` and run
```
git push --set-upstream apache FLINK-1745
```
and then run the git commands you mentioned to rebase?  Want to be careful, 
making a re-basing mistake can be a nightmare to fix :-)  





> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149173#comment-15149173
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-184847163
  
Hi @chiwanpark sorry for the delay!  I will now have more time to wrap this 
PR up.  I added a check just before calling `knn`:
```scala
  if (useQuadTree) {
if (metric.isInstanceOf[EuclideanDistanceMetric] ||
  metric.isInstanceOf[SquaredEuclideanDistanceMetric]){
  knnQueryWithQuadTree(training.values, testing.values, 
k, metric, queue, out)
} else {
  throw new IllegalArgumentException(s" Error: metric 
must be" +
s" Euclidean or SquaredEuclidean!")
}
  } else {
knnQueryBasic(training.values, testing.values, k, 
metric, queue, out)
  }
}
  }
}
```
Does that work?  The commit includes the hint for the cross operation as 
well. 


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114638#comment-15114638
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-174369547
  
Hi @danielblazevski, you don't need to open a new PR and merge master 
branch. Instead, you update `master` branch and rebase your local `FLINK-1745` 
branch on `master` branch. After doing rebase, you have to force push on your 
github `FLINK-1745` branch.

```bash
# fetch updated master branch
git fetch upstream master
# checkout local master branch
git checkout master 
# merge local master branch and upstream master branch (this should be 
fast-forward merge.)
git merge upstream/master
# checkout local FLINK-1745 branch
git checkout FLINK-1745
# rebase FLINK-1745 on local master branch
git rebase master
# force push local FLINK-1745 branch to github's FLINK-1745 branch
git push origin +FLINK-1745
```
Note that there is `+` before `FLINK-1745` to force push.

About raising error, I think the user specifies all parameters before 
calling `fit` method in typical case. Currently, the error will raise doing 
cross operation because checking metric is in `minDist` method of `QuadTree` 
class. I would like to check this metric conflict before doing operation. It is 
best to add a method like `checkQuadTreeConflict` in `KNN` class and call it in 
`setUseQuadTree` and `setDistanceMetric` method or call it in anyway before 
doing operation.


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114626#comment-15114626
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r50647700
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.DataSetUtils._
+//import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+DistanceMetric, EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import org.apache.flink.ml.nn.util.QuadTree
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
+parameters.add(K, k)
+this
+  }
+
+  /** Sets the distance metric
+* @param metric the distance metric to calculate distance between two 
points
+*/
+  def setDistanceMetric(metric: DistanceMetric): KNN = {
+parameters.add(DistanceMetric, metric)
+this
+  }
+
+  /** Sets the number of data blocks/partitions
+* @param n the number of data blocks
+*/
+  def setBlocks(n: Int): KNN = {
+require(n > 0, "Number of blocks must be positive.")
+parameters.add(Blocks, n)
+this
+  }
+
+  /**
+   * Sets the Boolean variable that decides whether to use the QuadTree or 
not
+   */
+  def setUseQuadTree(UseQuadTree: Boolean): KNN = {

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114627#comment-15114627
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r50647792
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.DataSetUtils._
+//import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+DistanceMetric, EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import org.apache.flink.ml.nn.util.QuadTree
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
+parameters.add(K, k)
+this
+  }
+
+  /** Sets the distance metric
+* @param metric the distance metric to calculate distance between two 
points
+*/
+  def setDistanceMetric(metric: DistanceMetric): KNN = {
+parameters.add(DistanceMetric, metric)
+this
+  }
+
+  /** Sets the number of data blocks/partitions
+* @param n the number of data blocks
+*/
+  def setBlocks(n: Int): KNN = {
+require(n > 0, "Number of blocks must be positive.")
+parameters.add(Blocks, n)
+this
+  }
+
+  /**
+   * Sets the Boolean variable that decides whether to use the QuadTree or 
not
+   */
+  def setUseQuadTree(UseQuadTree: Boolean): KNN = 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114629#comment-15114629
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r50647908
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.DataSetUtils._
+//import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+DistanceMetric, EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
+
+import org.apache.flink.ml.nn.util.QuadTree
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = ...
+  *   val testingDS: DataSet[Vector] = ...
+  *
+  *   val knn = KNN()
+  * .setK(10)
+  * .setBlocks(5)
+  * .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  *   knn.fit(trainingDS)
+  *
+  *   val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
+parameters.add(K, k)
+this
+  }
+
+  /** Sets the distance metric
+* @param metric the distance metric to calculate distance between two 
points
+*/
+  def setDistanceMetric(metric: DistanceMetric): KNN = {
+parameters.add(DistanceMetric, metric)
+this
+  }
+
+  /** Sets the number of data blocks/partitions
+* @param n the number of data blocks
+*/
+  def setBlocks(n: Int): KNN = {
+require(n > 0, "Number of blocks must be positive.")
+parameters.add(Blocks, n)
+this
+  }
+
+  /**
+   * Sets the Boolean variable that decides whether to use the QuadTree or 
not
+   */
+  def setUseQuadTree(UseQuadTree: Boolean): KNN = 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114458#comment-15114458
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-174329818
  
@chiwanpark I see, I thought maybe there was a way to not even use a cross 
at all.  I changed the code according to your suggestion and got an error.  

First, I assumed to add a line 
```scala
val sizeHint = resultParameters.get(SizeHint).get
```
before the 
```scala 
val crossTuned = sizeHint match {...
``` 
clause.  Attached is a screenshot form IntelliJ.  
https://cloud.githubusercontent.com/assets/10012612/12538089/3d9801a4-c29e-11e5-9c8d-419c06fa7553.png;>

Another logistical question for @chiwanpark and @tillrohrmann is that I see 
the directory structure of Flink has changed since my initial PR.  I'm not sure 
what is the best practice here.  I see a couple of less-than-ideal options:  
(1) create a new PR with updated directory structure, not ideal (2) pull the 
master branch, merge with this branch, but then when I commit many many commits 
will be added not relevant to this PR when I merge (less ideal...).  

On a smaller note, I see your point @chiwanpark about raising the flag 
earlier with the choice of metric and using a quadtree.  Do we want to do this 
in `fit` though?  In `fit`, I can get the metric and the parameter 
`useQuadTree`, but if the user does not specify `setUseQuadTree`, then I still 
have a conservative test that requires one to know how many training and test 
points there are.  That will determine whether or not to use the quadtree (i.e. 
will only use a quadtree if it will improve performance based on a conservative 
test).  Is it OK to put in `predictValues` instead where all the variables 
needed -- metric, training  and test sets -- have been passed?  Otherwise I 
will have to re-factor the code more.  

I changed the format based on @chiwanpark 's suggestion to make it look 
like what @tillrohrmann suggested.  

I committed and pushed the code if you'd like (added a knn.md file in docs, 
but that is still very much a work in progress :-) 


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101101#comment-15101101
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-171852976
  
@danielblazevski, I think we can use `crossWithTiny` and `crossWithHuge` 
method to reduce shuffle cost. Best approach is that counting elements in both 
datasets and decide method to cross, but currently we simply add a parameter to 
decide this like following:

```scala
import 
org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint

class KNN {
  // ...

  def setSizeHint(sizeHint: CrossHint): KNN = {
parameters.add(SizeHint, sizeHint)
this
  }

  // ...
}

object KNN {
  // ...

  case object SizeHint extends Parameter[CrossHint] {
val defaultValue: Option[CrossHint] = None
  }

  // ...
}
```

And we can use the parameter in `predictValues` method:

```scala
val crossTuned = sizeHint match {
  case Some(hint) if hint == CrossHint.FIRST_IS_SMALL =>
trainingSet.crossWithHuge(inputSplit)
  case Some(hint) if hint == CrossHint.SECOND_IS_SMALL =>
trainingSet.crossWithTiny(inputSplit)
  case _ => trainingSet.cross(inputSplit)
}

val crossed = crossTuned.mapPartition {
  // ...
}

// ...
```

We have to decide the name of added parameter (`SizeHint`) and add 
documentation of explanation that which dataset is first (training) and which 
dataset is second (testing).

By the way, there is no documentation for k-NN. Could you add the 
documentation to `docs/ml` directory? 


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096113#comment-15096113
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-171277446
  
Hi @chiwanpark , I agree `cross` is an expensive computation.  That part of 
the code was adopted from your earlier version.  Before I try to change it, do 
you have ideas on the best strategy to fix it using the most efficient 
Flink-esque features?


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2016-01-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093731#comment-15093731
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/1220#issuecomment-170875879
  
Hi @danielblazevski, I'm sorry for late reply. If you turn off IntelliJ 
IDEA align option (Turn off Preferences -> Editor -> Code Style -> Scala -> 
Wrapping and Braces -> Method declaration parameters -> Align when multiline), 
you can get style that is suggested by @tillrohrmann.

Could you apply this option?

I think that your PR is almost ready to merge. But I have to check few 
problems that still exist.

First, about a meaning of `UseQuadTree` parameter, you said that it means 
force-use quadtree. I think this would be a problem because `DistanceMeasure` 
parameter can be conflict with quadtree. I would like to raise an error earlier 
if the parameter setting has a problem. Could you add this into top of fit 
operation?

Second, how about avoiding `cross` operation? As @tillrohrmann said, 
`cross` operation is a very heavy operation. Is there any nicer solution to 
this?

Other problems such as some difference styles, unnecessary spaces can be 
addressed by me before merge this. :)


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049556#comment-15049556
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166220
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049554#comment-15049554
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166211
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049571#comment-15049571
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166315
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049558#comment-15049558
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166230
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049566#comment-15049566
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166281
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,316 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+DistanceMetric, EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+
+import org.apache.flink.ml.nn.util.QuadTree
+import scala.collection.mutable.ListBuffer
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  * val trainingDS: DataSet[Vector] = ...
+  * val testingDS: DataSet[Vector] = ...
+  *
+  * val knn = KNN()
+  *   .setK(10)
+  *   .setBlocks(5)
+  *   .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  * knn.fit(trainingDS)
+  *
+  * val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
+parameters.add(K, k)
+this
+  }
+
+  /** Sets the distance metric
+* @param metric the distance metric to calculate distance between two 
points
+*/
+  def setDistanceMetric(metric: DistanceMetric): KNN = {
+parameters.add(DistanceMetric, metric)
+this
+  }
+
+  /** Sets the number of data blocks/partitions
+* @param n the number of data blocks
+*/
+  def setBlocks(n: Int): KNN = {
+require(n > 0, "Number of blocks must be positive.")
+parameters.add(Blocks, n)
+this
+  }
+
+  /**
+   * Sets the Boolean variable that decides whether to use the QuadTree or 
not
+*/
+  def setUseQuadTree(UseQuadTree: Boolean): KNN = {
+parameters.add(UseQuadTreeParam, UseQuadTree)
+this
+  }
+
+}
+

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049555#comment-15049555
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166215
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049565#comment-15049565
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166275
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,316 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+DistanceMetric, EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+
+import org.apache.flink.ml.nn.util.QuadTree
+import scala.collection.mutable.ListBuffer
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  * val trainingDS: DataSet[Vector] = ...
+  * val testingDS: DataSet[Vector] = ...
+  *
+  * val knn = KNN()
+  *   .setK(10)
+  *   .setBlocks(5)
+  *   .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  * knn.fit(trainingDS)
+  *
+  * val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
+parameters.add(K, k)
+this
+  }
+
+  /** Sets the distance metric
+* @param metric the distance metric to calculate distance between two 
points
+*/
+  def setDistanceMetric(metric: DistanceMetric): KNN = {
+parameters.add(DistanceMetric, metric)
+this
+  }
+
+  /** Sets the number of data blocks/partitions
+* @param n the number of data blocks
+*/
+  def setBlocks(n: Int): KNN = {
+require(n > 0, "Number of blocks must be positive.")
+parameters.add(Blocks, n)
+this
+  }
+
+  /**
+   * Sets the Boolean variable that decides whether to use the QuadTree or 
not
+*/
+  def setUseQuadTree(UseQuadTree: Boolean): KNN = {
+parameters.add(UseQuadTreeParam, UseQuadTree)
+this
+  }
+
+}
+

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049563#comment-15049563
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166268
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/KNN.scala ---
@@ -0,0 +1,316 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn
+
+import org.apache.flink.api.common.operators.Order
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala.utils._
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common._
+import org.apache.flink.ml.math.{Vector => FlinkVector, DenseVector}
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+DistanceMetric, EuclideanDistanceMetric}
+import org.apache.flink.ml.pipeline.{FitOperation, 
PredictDataSetOperation, Predictor}
+import org.apache.flink.util.Collector
+
+import org.apache.flink.ml.nn.util.QuadTree
+import scala.collection.mutable.ListBuffer
+
+import scala.collection.immutable.Vector
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/** Implements a k-nearest neighbor join.
+  *
+  * Calculates the `k` nearest neighbor points in the training set for 
each point in the test set.
+  *
+  * @example
+  * {{{
+  * val trainingDS: DataSet[Vector] = ...
+  * val testingDS: DataSet[Vector] = ...
+  *
+  * val knn = KNN()
+  *   .setK(10)
+  *   .setBlocks(5)
+  *   .setDistanceMetric(EuclideanDistanceMetric())
+  *
+  * knn.fit(trainingDS)
+  *
+  * val predictionDS: DataSet[(Vector, Array[Vector])] = 
knn.predict(testingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.nn.KNN.K]]
+  * Sets the K which is the number of selected points as neighbors. 
(Default value: '''5''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.Blocks]]
+  * Sets the number of blocks into which the input data will be split. 
This number should be set
+  * at least to the degree of parallelism. If no value is specified, then 
the parallelism of the
+  * input [[DataSet]] is used as the number of blocks. (Default value: 
'''None''')
+  *
+  * - [[org.apache.flink.ml.nn.KNN.DistanceMetric]]
+  * Sets the distance metric we use to calculate the distance between two 
points. If no metric is
+  * specified, then 
[[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
+  * (Default value: '''EuclideanDistanceMetric()''')
+  *
+  */
+
+class KNN extends Predictor[KNN] {
+
+  import KNN._
+
+  var trainingSet: Option[DataSet[Block[FlinkVector]]] = None
+
+  /** Sets K
+* @param k the number of selected points as neighbors
+*/
+  def setK(k: Int): KNN = {
+require(k > 0, "K must be positive.")
+parameters.add(K, k)
+this
+  }
+
+  /** Sets the distance metric
+* @param metric the distance metric to calculate distance between two 
points
+*/
+  def setDistanceMetric(metric: DistanceMetric): KNN = {
+parameters.add(DistanceMetric, metric)
+this
+  }
+
+  /** Sets the number of data blocks/partitions
+* @param n the number of data blocks
+*/
+  def setBlocks(n: Int): KNN = {
+require(n > 0, "Number of blocks must be positive.")
+parameters.add(Blocks, n)
+this
+  }
+
+  /**
+   * Sets the Boolean variable that decides whether to use the QuadTree or 
not
+*/
+  def setUseQuadTree(UseQuadTree: Boolean): KNN = {
+parameters.add(UseQuadTreeParam, UseQuadTree)
+this
+  }
+
+}
+

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049560#comment-15049560
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166247
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049567#comment-15049567
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166286
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049569#comment-15049569
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166296
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049607#comment-15049607
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47168358
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
--- End diff --

In case later down the road, someone may want to define a quadtree for some 
other purpose and not need minDist, the kNN query only needs to enforce 
Euclidean/SquaredEuclidean because of the minDist function that defines the 
distance between a point and a box


> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049549#comment-15049549
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166187
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049546#comment-15049546
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166169
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049545#comment-15049545
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166163
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049544#comment-15049544
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166159
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-12-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049557#comment-15049557
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user danielblazevski commented on a diff in the pull request:

https://github.com/apache/flink/pull/1220#discussion_r47166223
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/QuadTree.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.nn.util
+
+import org.apache.flink.ml.math.{Breeze, Vector}
+import Breeze._
+
+import 
org.apache.flink.ml.metrics.distances.{SquaredEuclideanDistanceMetric,
+EuclideanDistanceMetric, DistanceMetric}
+
+import scala.collection.mutable.ListBuffer
+import scala.collection.mutable.PriorityQueue
+
+/**
+ * n-dimensional QuadTree data structure; partitions
+ * spatial data for faster queries (e.g. KNN query)
+ * The skeleton of the data structure was initially
+ * based off of the 2D Quadtree found here:
+ * 
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala
+ *
+ * Many additional methods were added to the class both for
+ * efficient KNN queries and generalizing to n-dim.
+ *
+ * @param minVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param maxVec vector of the corner of the bounding box with smallest 
coordinates
+ * @param distMetric metric, must be Euclidean or squareEuclidean
+ * @param maxPerBox threshold for number of points in each box before 
slitting a box
+ */
+class QuadTree(minVec: Vector, maxVec: Vector, distMetric: DistanceMetric, 
maxPerBox: Int){
+
+  class Node(center: Vector, width: Vector, var children: Seq[Node]) {
+
+val nodeElements = new ListBuffer[Vector]
+
+/** for testing purposes only; used in QuadTreeSuite.scala
+  *
+  * @return center and width of the box
+  */
+def getCenterWidth(): (Vector, Vector) = {
+  (center, width)
+}
+
+def contains(queryPoint: Vector): Boolean = {
+  overlap(queryPoint, 0.0)
+}
+
+/** Tests if queryPoint is within a radius of the node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def overlap(queryPoint: Vector, radius: Double): Boolean = {
+  var count = 0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) - radius < center(i) + width(i) / 2 &&
+  queryPoint(i) + radius > center(i) - width(i) / 2) {
+  count += 1
+}
+  }
+
+  if (count == queryPoint.size) {
+true
+  } else {
+false
+  }
+}
+
+/** Tests if queryPoint is near a node
+  *
+  * @param queryPoint
+  * @param radius
+  * @return
+  */
+def isNear(queryPoint: Vector, radius: Double): Boolean = {
+  if (minDist(queryPoint) < radius) {
+true
+  } else {
+false
+  }
+}
+
+/**
+ * used in error handling when computing minDist to make sure
+ * distMetric is Euclidean or SquaredEuclidean
+ * @param message
+ */
+case class metricException(message: String) extends Exception(message)
+
+/**
+ * minDist is defined so that every point in the box
+ * has distance to queryPoint greater than minDist
+ * (minDist adopted from "Nearest Neighbors Queries" by N. 
Roussopoulos et al.)
+ *
+ * @param queryPoint
+ * @return
+ */
+
+def minDist(queryPoint: Vector): Double = {
+  var minDist = 0.0
+  for (i <- 0 to queryPoint.size - 1) {
+if (queryPoint(i) < center(i) - width(i) / 2) {
+  minDist += math.pow(queryPoint(i) - center(i) + width(i) / 2, 

  1   2   3   4   >