[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2017-01-17 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826287#comment-15826287
 ] 

Daniel Blazevski commented on FLINK-1934:
-

Last update was that I made a pull request for this last summer

https://github.com/apache/flink/pull/2050


> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].  Reference [3] 
> is cited a few times in [1], and gives necessary background on the z-value 
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2144) Incremental count, average, and variance for windows

2016-09-09 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477990#comment-15477990
 ] 

Daniel Blazevski edited comment on FLINK-2144 at 9/9/16 8:49 PM:
-

Hi,

I am curious to know the status of this issue.  

I have worked on Flink-ML a bit, and recently contributed to Flink's 
documentation of its Window API.  

If possible, I would like to work on this.  I looked into the code and saw how 
"sum" is implemented, and I assume average and variance should be implemented 
in similar ways (with 2 versions, one taking a string another an integer).  
Count, of course, is more simple in that it counts the number of data points 
(not two different inputs)

[~aljoscha][~ggevay]



was (Author: danielblazevski):
Hi,

I am curious to know the status of this issue.  

I have worked on Flink-ML a bit, and recently contributed to Flink's 
documentation of its Window API.  

If possible, I would like to work on this.  I looked into the code and saw how 
"sum" is implemented, and I assume average and variance should be implemented 
in similar ways (with 2 versions, one taking a string another an integer).  
Count, of course, is more simple in that it counts the number of data points 
(not two different inputs)



> Incremental count, average, and variance for windows
> 
>
> Key: FLINK-2144
> URL: https://issues.apache.org/jira/browse/FLINK-2144
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> By count I mean the number of elements in the window.
> These can be implemented very efficiently building on FLINK-2143:
> Store: O(1)
> Evict: O(1)
> emitWindow: O(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2144) Incremental count, average, and variance for windows

2016-09-09 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477990#comment-15477990
 ] 

Daniel Blazevski commented on FLINK-2144:
-

Hi,

I am curious to know the status of this issue.  

I have worked on Flink-ML a bit, and recently contributed to Flink's 
documentation of its Window API.  

If possible, I would like to work on this.  I looked into the code and saw how 
"sum" is implemented, and I assume average and variance should be implemented 
in similar ways (with 2 versions, one taking a string another an integer).  
Count, of course, is more simple in that it counts the number of data points 
(not two different inputs)



> Incremental count, average, and variance for windows
> 
>
> Key: FLINK-2144
> URL: https://issues.apache.org/jira/browse/FLINK-2144
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> By count I mean the number of elements in the window.
> These can be implemented very efficiently building on FLINK-2143:
> Store: O(1)
> Evict: O(1)
> emitWindow: O(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction

2016-08-06 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410696#comment-15410696
 ] 

Daniel Blazevski commented on FLINK-3899:
-

[~aljoscha]

I see, would it be enough to expand on the example "WindowFunction with 
Incremental Aggregation" here:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/windows.html

with specific examples for MyFoldFunction, MyReduceFunction, MyWindowFunction?  

> Document window processing with Reduce/FoldFunction + WindowFunction
> 
>
> Key: FLINK-3899
> URL: https://issues.apache.org/jira/browse/FLINK-3899
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.1.0
>Reporter: Fabian Hueske
>
> The streaming documentation does not describe how windows can be processed 
> with FoldFunction or ReduceFunction and a subsequent WindowFunction. This 
> combination allows for eager window aggregation (only a single element is 
> kept in the window) and access of the Window object, e.g., to have access to 
> the window's start and end time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction

2016-08-03 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406319#comment-15406319
 ] 

Daniel Blazevski edited comment on FLINK-3899 at 8/3/16 6:50 PM:
-

[~fhueske]

I'm happy to take this one if that's OK -- I've contributed a bit to FlinkML 
and have been learning/playing around with Flink DataStreams, this could be a 
good way to dig a bit deeper into Flink's streaming APIs (and possibly later 
contribute code)

To clarify, the idea is simply to have an example with

myWindow.fold(...).keyBy(0).window(...)
and
myWindow.reduce(...).keyBy(0).window(...)

Is that correct?


was (Author: danielblazevski):
[~fhueske]

I'm happy to take this one if that's OK -- I've contributed a bit to FlinkML 
and have been learning/playing around with Flink DataStreams, this could be a 
good way to dig a bit deeper into Flink's streaming APIs (and possibly later 
contribute code)


> Document window processing with Reduce/FoldFunction + WindowFunction
> 
>
> Key: FLINK-3899
> URL: https://issues.apache.org/jira/browse/FLINK-3899
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.1.0
>Reporter: Fabian Hueske
>
> The streaming documentation does not describe how windows can be processed 
> with FoldFunction or ReduceFunction and a subsequent WindowFunction. This 
> combination allows for eager window aggregation (only a single element is 
> kept in the window) and access of the Window object, e.g., to have access to 
> the window's start and end time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction

2016-08-03 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406319#comment-15406319
 ] 

Daniel Blazevski edited comment on FLINK-3899 at 8/3/16 6:04 PM:
-

[~fhueske]

I'm happy to take this one if that's OK -- I've contributed a bit to FlinkML 
and have been learning/playing around with Flink DataStreams, this could be a 
good way to dig a bit deeper into Flink's streaming APIs (and possibly later 
contribute code)



was (Author: danielblazevski):
[~fhueske]

I'm happy to take this one if that's OK -- I've worked a bit on FlinkML and 
have been learning/playing around with Flink DataStreams, this could be a good 
way to dig a bit deeper into Flink's streaming APIs (and possibly later 
contribute code)


> Document window processing with Reduce/FoldFunction + WindowFunction
> 
>
> Key: FLINK-3899
> URL: https://issues.apache.org/jira/browse/FLINK-3899
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.1.0
>Reporter: Fabian Hueske
>
> The streaming documentation does not describe how windows can be processed 
> with FoldFunction or ReduceFunction and a subsequent WindowFunction. This 
> combination allows for eager window aggregation (only a single element is 
> kept in the window) and access of the Window object, e.g., to have access to 
> the window's start and end time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction

2016-08-03 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406319#comment-15406319
 ] 

Daniel Blazevski commented on FLINK-3899:
-

[~fhueske]

I'm happy to take this one if that's OK -- I've worked a bit on FlinkML and 
have been learning/playing around with Flink DataStreams, this could be a good 
way to dig a bit deeper into Flink's streaming APIs (and possibly later 
contribute code)


> Document window processing with Reduce/FoldFunction + WindowFunction
> 
>
> Key: FLINK-3899
> URL: https://issues.apache.org/jira/browse/FLINK-3899
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.1.0
>Reporter: Fabian Hueske
>
> The streaming documentation does not describe how windows can be processed 
> with FoldFunction or ReduceFunction and a subsequent WindowFunction. This 
> combination allows for eager window aggregation (only a single element is 
> kept in the window) and access of the Window object, e.g., to have access to 
> the window's start and end time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3996) Add addition, subtraction and multiply by scalar to DenseVector.scala and SparseVector.scala

2016-05-31 Thread Daniel Blazevski (JIRA)
Daniel Blazevski created FLINK-3996:
---

 Summary: Add addition, subtraction and multiply by scalar to 
DenseVector.scala and SparseVector.scala
 Key: FLINK-3996
 URL: https://issues.apache.org/jira/browse/FLINK-3996
 Project: Flink
  Issue Type: Improvement
Reporter: Daniel Blazevski
Assignee: Daniel Blazevski
Priority: Minor


Small change to add vector operations. With this small change, can now do 
things like:

val v1 = DenseVector(0.1, 0.1)
val v2 = DenseVector(0.2, 0.2)
val v3 = v1 + v2
instead of what is now has to be done:

val v1 = DenseVector(0.1, 0.1)
val v2 = DenseVector(0.2, 0.2)
val v3 = (v1.asBreeze + v2.asBreeze).fromBreeze



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2016-04-06 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228242#comment-15228242
 ] 

Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 1:33 PM:
-

That all sounds good.

As of now I've only done simple tests to make sure that algorithm is working 
correctly:  e.g. put one test point much closer to a single training point and 
make sure that is the nearest neighbor.  Indeed, more tests and benchmarks are 
needed -- for example to make sure that even if it doesn't compute the exact 
nearest neighbors, that the neighbors are "nearby".  




was (Author: danielblazevski):
That all sounds good.

As of now I've only done simple tests to make sure that algorithm is working 
correctly:  e.g. put one test point much closer to a single training point and 
make sure that is the nearest neighbor.  Indeed, more tests and benchmarks are 
needed -- for example to make sure that even it doesn't compute the exact 
nearest neighbors, that the neighbors are "nearby".  



> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].  Reference [3] 
> is cited a few times in [1], and gives necessary background on the z-value 
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2016-04-06 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228242#comment-15228242
 ] 

Daniel Blazevski commented on FLINK-1934:
-

That all sounds good.

As of now I've only done simple tests to make sure that algorithm is working 
correctly:  e.g. put one test point much closer to a single training point and 
make sure that is the nearest neighbor.  Indeed, more tests and benchmarks are 
needed -- for example to make sure that even it doesn't compute the exact 
nearest neighbors, that the neighbors are "nearby".  



> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].  Reference [3] 
> is cited a few times in [1], and gives necessary background on the z-value 
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2016-04-05 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227632#comment-15227632
 ] 

Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:07 AM:
-

[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up 
and running.  The exact knn using a quadtree performs quite bad in 
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, 
the quadtree is worse, but no worries I took care of this and the exact decides 
when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training 
points are increased.  

8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =   : 2s
Elapsed time exact =   : 27s

64,000 in 6D:  
Elapsed time approx =   : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the 
run-time a bit, but will definitely not increase the complexity of the 
algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if 
that's possible.  Any thoughts?  Basically the idea is to hash the test/train 
set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is 
quite complex.  They do a bunch of load balancing that I'm still learning, and 
not quite sure of the payoff.  One option could be that I clean up what I have 
now and optimize since it's already performing well, and we open a new issue 
for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not 
add in computational complexity with respect to the number of test points.  
e.g. I now convert the coordinates to integers and ignore the decimal part and 
there are now lots of collisions in the z-value hash, normalizing the data and 
adding a fixed max number of bits to compute the z-value is needed, but will 
definitely not increase the complexity with respect to adding more 
test/training points (this is described towards the end of [3])

Any thoughts?


was (Author: danielblazevski):
[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up 
and running.  The exact knn using a quadtree performs quite bad in 
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, 
the quadtree is worse, but no worries I took care of this and the exact decides 
when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training 
points are increased.  

8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =   : 2s
Elapsed time exact =   : 27s

64,000 in 6D:  
Elapsed time approx =   : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the 
run-time a bit, but will definitely not increase the complexity of the 
algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if 
that's possible.  Any thoughts?  Basically the idea is to hash the test/train 
set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is 
quite complex.  They do a bunch of load balancing that I'm still learning, and 
not quite sure of the payoff.  One option could be that I clean up what I have 
now and optimize since it's already performing well, and we open a new issue 
for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not 
add in computational complexity with respect to the number of test points.  
e.g. I now convert the coordinates to integers and ignore the decimal part and 
there are now lots of collisions in the z-value hash, normalizing the data and 
adding a fixed max number of bits to compute the z-value (this is described 
towards the end of [3])

Any thoughts?

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel 

[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2016-04-05 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227632#comment-15227632
 ] 

Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:05 AM:
-

[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up 
and running.  The exact knn using a quadtree performs quite bad in 
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, 
the quadtree is worse, but no worries I took care of this and the exact decides 
when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training 
points are increased.  

8,000 points in 6D (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =   : 2s
Elapsed time exact =   : 27s

64,000 in 6D:  
Elapsed time approx =   : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the 
run-time a bit, but will definitely not increase the complexity of the 
algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if 
that's possible.  Any thoughts?  Basically the idea is to hash the test/train 
set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is 
quite complex.  They do a bunch of load balancing that I'm still learning, and 
not quite sure of the payoff.  One option could be that I clean up what I have 
now and optimize since it's already performing well, and we open a new issue 
for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not 
add in computational complexity with respect to the number of test points.  
e.g. I now convert the coordinates to integers and ignore the decimal part and 
there are now lots of collisions in the z-value hash, normalizing the data and 
adding a fixed max number of bits to compute the z-value (this is described 
towards the end of [3])

Any thoughts?


was (Author: danielblazevski):
[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up 
and running.  The exact knn using a quadtree performs quite bad in 
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, 
the quadtree is worse, but no worries I took care of this and the exact decides 
when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training 
points are increased.  

8,000 points (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =   : 2s
Elapsed time exact =   : 27s

64,000:  
Elapsed time approx =   : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the 
run-time a bit, but will definitely not increase the complexity of the 
algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if 
that's possible.  Any thoughts?  Basically the idea is to hash the test/train 
set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is 
quite complex.  They do a bunch of load balancing that I'm still learning, and 
not quite sure of the payoff.  One option could be that I clean up what I have 
now and optimize since it's already performing well, and we open a new issue 
for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not 
add in computational complexity with respect to the number of test points.  
e.g. I now convert the coordinates to integers and ignore the decimal part and 
there are now lots of collisions in the z-value hash, normalizing the data and 
adding a fixed max number of bits to compute the z-value (this is described 
towards the end of [3])

Any thoughts?

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> 

[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2016-04-05 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227632#comment-15227632
 ] 

Daniel Blazevski commented on FLINK-1934:
-

[~chiwanpark] [~till.rohrmann]

I have a Flink version -- still a bit preliminary -- of the approximate knn up 
and running.  The exact knn using a quadtree performs quite bad in 
moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, 
the quadtree is worse, but no worries I took care of this and the exact decides 
when to use quadtree or not).  

https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala

A preliminary test shows good scaling with the number when the test + training 
points are increased.  

8,000 points (i.e. 8,000 test points and 8,000 training points):
Elapsed time approx =   : 2s
Elapsed time exact =   : 27s

64,000:  
Elapsed time approx =   : 6s
(didn't run the exact version, we know it's O(N^2))

I will have to clean things up, add edge cases, etc which may slow down the 
run-time a bit, but will definitely not increase the complexity of the 
algorithm with respect to the number of test/training points.

This still use a cross product, which I was hoping to avoid, but not sure if 
that's possible.  Any thoughts?  Basically the idea is to hash the test/train 
set to 1D (I use the z-value hash based on [1]). 

I still have not implemented the ideas in [1] in full.  The full solution is 
quite complex.  They do a bunch of load balancing that I'm still learning, and 
not quite sure of the payoff.  One option could be that I clean up what I have 
now and optimize since it's already performing well, and we open a new issue 
for to do all the steps in [1].  

There are still many things to clean up, but any cleaning/edge cases will not 
add in computational complexity with respect to the number of test points.  
e.g. I now convert the coordinates to integers and ignore the decimal part and 
there are now lots of collisions in the z-value hash, normalizing the data and 
adding a fixed max number of bits to compute the z-value (this is described 
towards the end of [3])

Any thoughts?

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].  Reference [3] 
> is cited a few times in [1], and gives necessary background on the z-value 
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2015-11-02 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985273#comment-14985273
 ] 

Daniel Blazevski edited comment on FLINK-1934 at 11/2/15 2:20 PM:
--

[~till.rohrmann] , I have been looking to approximate knn algorithms.  I found 
some more details about Ref [1] in Ref. [3], which I added in the Description 
above.  Ref. [3] is cited several times in Ref. [1], and [3] mentions that the 
method of z-functions is best for dimensions < 10, and that locality sensitive 
hashing (LSH) is better for dimensions ~ 30 and above.  

How about I give Ref [1] a shot, and maybe later down the road add LSH?  If so, 
feel free to assign me to this issue.


was (Author: danielblazevski):
[~till.rohrmann] , I have been looking to approximate knn algorithms.  I found 
here in Ref. [3] that I added in the Description above.  Ref. [3] is cited 
several times in Ref. [1], and [3] mentions that the method of z-functions is 
best for dimensions < 10, and that locality sensitive hashing (LSH) is better 
for dimensions ~ 30 and above.  

How about I give Ref [1] a shot, and maybe later down the road add LSH?  If so, 
feel free to assign me to this issue.

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].  Reference [3] 
> is cited a few times in [1], and gives necessary background on the z-value 
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2015-11-02 Thread Daniel Blazevski (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Blazevski updated FLINK-1934:

Description: 
kNN is still a widely used algorithm for classification and regression. 
However, due to the computational costs of an exact implementation, it does not 
scale well to large amounts of data. Therefore, it is worthwhile to also add an 
approximative kNN implementation as proposed in [1,2].

Resources:
[1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
[2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
[3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf

  was:
kNN is still a widely used algorithm for classification and regression. 
However, due to the computational costs of an exact implementation, it does not 
scale well to large amounts of data. Therefore, it is worthwhile to also add an 
approximative kNN implementation as proposed in [1,2].

Resources:
[1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
[2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf


> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-09-22 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902478#comment-14902478
 ] 

Daniel Blazevski commented on FLINK-1745:
-

[~till.rohrmann],
I should have mentioned that I took this 2D quadtree:
http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala

and used that as a guide to form an n-Dimensional quadtree -- some 
generalizations were straightforward, others were  bit tricky/fun (like 
partitioning an n-dim box into equal sub-boxes)

> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-09-21 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901837#comment-14901837
 ] 

Daniel Blazevski commented on FLINK-1745:
-

Hi [~chiwanpark] and [~till.rohrmann],
I have implemented a Quadtree data structure to put into the KNN algorithm -- 
R-Tree version may follow.  I will take each Block of the training DataSet in 
the existing KNN and make a quadtree before finding the KNN on each block.

The quadtree class that I have is now a prototype -- using only 
Arrays/ListBuffer[Double] types and not yet any Flink data structures -- but 
before I transfer to Flink, let me know if you'd like the quadtree class put in 
a specific directory -- otherwise I will just put in the in the same directory 
as where KNN.scala is in (or maybe create a sub-directory where KNN.scala is)

Let me also generally know if there is anything specif you'd like to see done 
to the quadtree when going from basic Scala code and importing to Flink;  
otherwise I will look at other Flink code as a guide.  I have made a simple 
test of the quadtree, which I can also expand upon and include if you'd like.

Cheers,
Dan

> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-09-15 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746552#comment-14746552
 ] 

Daniel Blazevski commented on FLINK-1745:
-

Thanks!

Pulled and built Chinwan's branch, and further modified very minor changes
to his test to familiarize myself more with DataSets.

Dan


On Tue, Sep 15, 2015 at 4:47 AM, Till Rohrmann (JIRA) 



> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2015-09-11 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740627#comment-14740627
 ] 

Daniel Blazevski commented on FLINK-1934:
-

Hi [~till.rohrmann], Flink seems like a neat project, looking forward to 
contributing!

I had indeed looked into H-zkNNJ in Ref [1].  As I had just mentioned in the 
exact kNN, I think it will be more wise for me to try an R-tree implementation 
of the exact kNN first for my first Flink contribution -- so in the meantime, 
how about we don't assign this to me quite yet. 

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-09-11 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740584#comment-14740584
 ] 

Daniel Blazevski commented on FLINK-1745:
-

[~till.rohrmann], Nice, I am going to give the R-tree implementation a shot. As 
this is my first Flink contribution, I'm going to attempt this first before the 
approximate kNN (about to post another message there now)

> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-09-10 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740076#comment-14740076
 ] 

Daniel Blazevski edited comment on FLINK-1745 at 9/11/15 3:07 AM:
--

In fact, if what [~chiwanpark] has is thought to be a good base for a kNN 
algorithm, one option could be that I help to make it more efficient by 
modifying what he has and incorporate an R-tree ( 
https://en.wikipedia.org/wiki/R-tree ) ?


was (Author: danielblazevski):
In fact, if what [~chiwanpark] has is thought to be a good base for a kNN 
algorithm, one option could be that I help to make it more efficient by 
modifying what he has and incorporate an R-tree 
(https://en.wikipedia.org/wiki/R-tree)?

> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-09-10 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740076#comment-14740076
 ] 

Daniel Blazevski commented on FLINK-1745:
-

In fact, if what [~chiwanpark] has is thought to be a good base for a kNN 
algorithm, one option could be that I help to make it more efficient by 
modifying what he has and incorporate an R-tree 
(https://en.wikipedia.org/wiki/R-tree)?

> Add exact k-nearest-neighbours algorithm to machine learning library
> 
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression. This issue 
> focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
> proposed in [2].
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2015-09-10 Thread Daniel Blazevski (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739437#comment-14739437
 ] 

Daniel Blazevski commented on FLINK-1934:
-

Hello,

I have been thinking about implementing a distributed, scalable implementation 
of kNN and would like to know the current progress of anyone working on it 
before diving into the project.

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)