[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826287#comment-15826287 ] Daniel Blazevski commented on FLINK-1934: - Last update was that I made a pull request for this last summer https://github.com/apache/flink/pull/2050 > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. Reference [3] > is cited a few times in [1], and gives necessary background on the z-value > approach. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-2144) Incremental count, average, and variance for windows
[ https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477990#comment-15477990 ] Daniel Blazevski edited comment on FLINK-2144 at 9/9/16 8:49 PM: - Hi, I am curious to know the status of this issue. I have worked on Flink-ML a bit, and recently contributed to Flink's documentation of its Window API. If possible, I would like to work on this. I looked into the code and saw how "sum" is implemented, and I assume average and variance should be implemented in similar ways (with 2 versions, one taking a string another an integer). Count, of course, is more simple in that it counts the number of data points (not two different inputs) [~aljoscha][~ggevay] was (Author: danielblazevski): Hi, I am curious to know the status of this issue. I have worked on Flink-ML a bit, and recently contributed to Flink's documentation of its Window API. If possible, I would like to work on this. I looked into the code and saw how "sum" is implemented, and I assume average and variance should be implemented in similar ways (with 2 versions, one taking a string another an integer). Count, of course, is more simple in that it counts the number of data points (not two different inputs) > Incremental count, average, and variance for windows > > > Key: FLINK-2144 > URL: https://issues.apache.org/jira/browse/FLINK-2144 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > By count I mean the number of elements in the window. > These can be implemented very efficiently building on FLINK-2143: > Store: O(1) > Evict: O(1) > emitWindow: O(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2144) Incremental count, average, and variance for windows
[ https://issues.apache.org/jira/browse/FLINK-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477990#comment-15477990 ] Daniel Blazevski commented on FLINK-2144: - Hi, I am curious to know the status of this issue. I have worked on Flink-ML a bit, and recently contributed to Flink's documentation of its Window API. If possible, I would like to work on this. I looked into the code and saw how "sum" is implemented, and I assume average and variance should be implemented in similar ways (with 2 versions, one taking a string another an integer). Count, of course, is more simple in that it counts the number of data points (not two different inputs) > Incremental count, average, and variance for windows > > > Key: FLINK-2144 > URL: https://issues.apache.org/jira/browse/FLINK-2144 > Project: Flink > Issue Type: New Feature > Components: Streaming >Reporter: Gabor Gevay >Priority: Minor > Labels: statistics > > By count I mean the number of elements in the window. > These can be implemented very efficiently building on FLINK-2143: > Store: O(1) > Evict: O(1) > emitWindow: O(1) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction
[ https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410696#comment-15410696 ] Daniel Blazevski commented on FLINK-3899: - [~aljoscha] I see, would it be enough to expand on the example "WindowFunction with Incremental Aggregation" here: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/windows.html with specific examples for MyFoldFunction, MyReduceFunction, MyWindowFunction? > Document window processing with Reduce/FoldFunction + WindowFunction > > > Key: FLINK-3899 > URL: https://issues.apache.org/jira/browse/FLINK-3899 > Project: Flink > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.1.0 >Reporter: Fabian Hueske > > The streaming documentation does not describe how windows can be processed > with FoldFunction or ReduceFunction and a subsequent WindowFunction. This > combination allows for eager window aggregation (only a single element is > kept in the window) and access of the Window object, e.g., to have access to > the window's start and end time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction
[ https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406319#comment-15406319 ] Daniel Blazevski edited comment on FLINK-3899 at 8/3/16 6:50 PM: - [~fhueske] I'm happy to take this one if that's OK -- I've contributed a bit to FlinkML and have been learning/playing around with Flink DataStreams, this could be a good way to dig a bit deeper into Flink's streaming APIs (and possibly later contribute code) To clarify, the idea is simply to have an example with myWindow.fold(...).keyBy(0).window(...) and myWindow.reduce(...).keyBy(0).window(...) Is that correct? was (Author: danielblazevski): [~fhueske] I'm happy to take this one if that's OK -- I've contributed a bit to FlinkML and have been learning/playing around with Flink DataStreams, this could be a good way to dig a bit deeper into Flink's streaming APIs (and possibly later contribute code) > Document window processing with Reduce/FoldFunction + WindowFunction > > > Key: FLINK-3899 > URL: https://issues.apache.org/jira/browse/FLINK-3899 > Project: Flink > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.1.0 >Reporter: Fabian Hueske > > The streaming documentation does not describe how windows can be processed > with FoldFunction or ReduceFunction and a subsequent WindowFunction. This > combination allows for eager window aggregation (only a single element is > kept in the window) and access of the Window object, e.g., to have access to > the window's start and end time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction
[ https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406319#comment-15406319 ] Daniel Blazevski edited comment on FLINK-3899 at 8/3/16 6:04 PM: - [~fhueske] I'm happy to take this one if that's OK -- I've contributed a bit to FlinkML and have been learning/playing around with Flink DataStreams, this could be a good way to dig a bit deeper into Flink's streaming APIs (and possibly later contribute code) was (Author: danielblazevski): [~fhueske] I'm happy to take this one if that's OK -- I've worked a bit on FlinkML and have been learning/playing around with Flink DataStreams, this could be a good way to dig a bit deeper into Flink's streaming APIs (and possibly later contribute code) > Document window processing with Reduce/FoldFunction + WindowFunction > > > Key: FLINK-3899 > URL: https://issues.apache.org/jira/browse/FLINK-3899 > Project: Flink > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.1.0 >Reporter: Fabian Hueske > > The streaming documentation does not describe how windows can be processed > with FoldFunction or ReduceFunction and a subsequent WindowFunction. This > combination allows for eager window aggregation (only a single element is > kept in the window) and access of the Window object, e.g., to have access to > the window's start and end time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3899) Document window processing with Reduce/FoldFunction + WindowFunction
[ https://issues.apache.org/jira/browse/FLINK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406319#comment-15406319 ] Daniel Blazevski commented on FLINK-3899: - [~fhueske] I'm happy to take this one if that's OK -- I've worked a bit on FlinkML and have been learning/playing around with Flink DataStreams, this could be a good way to dig a bit deeper into Flink's streaming APIs (and possibly later contribute code) > Document window processing with Reduce/FoldFunction + WindowFunction > > > Key: FLINK-3899 > URL: https://issues.apache.org/jira/browse/FLINK-3899 > Project: Flink > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.1.0 >Reporter: Fabian Hueske > > The streaming documentation does not describe how windows can be processed > with FoldFunction or ReduceFunction and a subsequent WindowFunction. This > combination allows for eager window aggregation (only a single element is > kept in the window) and access of the Window object, e.g., to have access to > the window's start and end time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-3996) Add addition, subtraction and multiply by scalar to DenseVector.scala and SparseVector.scala
Daniel Blazevski created FLINK-3996: --- Summary: Add addition, subtraction and multiply by scalar to DenseVector.scala and SparseVector.scala Key: FLINK-3996 URL: https://issues.apache.org/jira/browse/FLINK-3996 Project: Flink Issue Type: Improvement Reporter: Daniel Blazevski Assignee: Daniel Blazevski Priority: Minor Small change to add vector operations. With this small change, can now do things like: val v1 = DenseVector(0.1, 0.1) val v2 = DenseVector(0.2, 0.2) val v3 = v1 + v2 instead of what is now has to be done: val v1 = DenseVector(0.1, 0.1) val v2 = DenseVector(0.2, 0.2) val v3 = (v1.asBreeze + v2.asBreeze).fromBreeze -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228242#comment-15228242 ] Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 1:33 PM: - That all sounds good. As of now I've only done simple tests to make sure that algorithm is working correctly: e.g. put one test point much closer to a single training point and make sure that is the nearest neighbor. Indeed, more tests and benchmarks are needed -- for example to make sure that even if it doesn't compute the exact nearest neighbors, that the neighbors are "nearby". was (Author: danielblazevski): That all sounds good. As of now I've only done simple tests to make sure that algorithm is working correctly: e.g. put one test point much closer to a single training point and make sure that is the nearest neighbor. Indeed, more tests and benchmarks are needed -- for example to make sure that even it doesn't compute the exact nearest neighbors, that the neighbors are "nearby". > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. Reference [3] > is cited a few times in [1], and gives necessary background on the z-value > approach. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228242#comment-15228242 ] Daniel Blazevski commented on FLINK-1934: - That all sounds good. As of now I've only done simple tests to make sure that algorithm is working correctly: e.g. put one test point much closer to a single training point and make sure that is the nearest neighbor. Indeed, more tests and benchmarks are needed -- for example to make sure that even it doesn't compute the exact nearest neighbors, that the neighbors are "nearby". > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. Reference [3] > is cited a few times in [1], and gives necessary background on the z-value > approach. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227632#comment-15227632 ] Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:07 AM: - [~chiwanpark] [~till.rohrmann] I have a Flink version -- still a bit preliminary -- of the approximate knn up and running. The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not). https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala A preliminary test shows good scaling with the number when the test + training points are increased. 8,000 points in 6D (i.e. 8,000 test points and 8,000 training points): Elapsed time approx = : 2s Elapsed time exact = : 27s 64,000 in 6D: Elapsed time approx = : 6s (didn't run the exact version, we know it's O(N^2)) I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points. This still use a cross product, which I was hoping to avoid, but not sure if that's possible. Any thoughts? Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). I still have not implemented the ideas in [1] in full. The full solution is quite complex. They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff. One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1]. There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points. e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value is needed, but will definitely not increase the complexity with respect to adding more test/training points (this is described towards the end of [3]) Any thoughts? was (Author: danielblazevski): [~chiwanpark] [~till.rohrmann] I have a Flink version -- still a bit preliminary -- of the approximate knn up and running. The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not). https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala A preliminary test shows good scaling with the number when the test + training points are increased. 8,000 points in 6D (i.e. 8,000 test points and 8,000 training points): Elapsed time approx = : 2s Elapsed time exact = : 27s 64,000 in 6D: Elapsed time approx = : 6s (didn't run the exact version, we know it's O(N^2)) I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points. This still use a cross product, which I was hoping to avoid, but not sure if that's possible. Any thoughts? Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). I still have not implemented the ideas in [1] in full. The full solution is quite complex. They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff. One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1]. There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points. e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this is described towards the end of [3]) Any thoughts? > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel
[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227632#comment-15227632 ] Daniel Blazevski edited comment on FLINK-1934 at 4/6/16 3:05 AM: - [~chiwanpark] [~till.rohrmann] I have a Flink version -- still a bit preliminary -- of the approximate knn up and running. The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not). https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala A preliminary test shows good scaling with the number when the test + training points are increased. 8,000 points in 6D (i.e. 8,000 test points and 8,000 training points): Elapsed time approx = : 2s Elapsed time exact = : 27s 64,000 in 6D: Elapsed time approx = : 6s (didn't run the exact version, we know it's O(N^2)) I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points. This still use a cross product, which I was hoping to avoid, but not sure if that's possible. Any thoughts? Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). I still have not implemented the ideas in [1] in full. The full solution is quite complex. They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff. One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1]. There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points. e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this is described towards the end of [3]) Any thoughts? was (Author: danielblazevski): [~chiwanpark] [~till.rohrmann] I have a Flink version -- still a bit preliminary -- of the approximate knn up and running. The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not). https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala A preliminary test shows good scaling with the number when the test + training points are increased. 8,000 points (i.e. 8,000 test points and 8,000 training points): Elapsed time approx = : 2s Elapsed time exact = : 27s 64,000: Elapsed time approx = : 6s (didn't run the exact version, we know it's O(N^2)) I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points. This still use a cross product, which I was hoping to avoid, but not sure if that's possible. Any thoughts? Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). I still have not implemented the ideas in [1] in full. The full solution is quite complex. They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff. One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1]. There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points. e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this is described towards the end of [3]) Any thoughts? > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML > > kNN is still a widely used algorithm for classification and regression. >
[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227632#comment-15227632 ] Daniel Blazevski commented on FLINK-1934: - [~chiwanpark] [~till.rohrmann] I have a Flink version -- still a bit preliminary -- of the approximate knn up and running. The exact knn using a quadtree performs quite bad in moderate-to-high spatial dimension (e.g 20,000 test and training points in 6D, the quadtree is worse, but no worries I took care of this and the exact decides when to use quadtree or not). https://github.com/danielblazevski/flink/blob/FLINK-1934/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn/zknn.scala A preliminary test shows good scaling with the number when the test + training points are increased. 8,000 points (i.e. 8,000 test points and 8,000 training points): Elapsed time approx = : 2s Elapsed time exact = : 27s 64,000: Elapsed time approx = : 6s (didn't run the exact version, we know it's O(N^2)) I will have to clean things up, add edge cases, etc which may slow down the run-time a bit, but will definitely not increase the complexity of the algorithm with respect to the number of test/training points. This still use a cross product, which I was hoping to avoid, but not sure if that's possible. Any thoughts? Basically the idea is to hash the test/train set to 1D (I use the z-value hash based on [1]). I still have not implemented the ideas in [1] in full. The full solution is quite complex. They do a bunch of load balancing that I'm still learning, and not quite sure of the payoff. One option could be that I clean up what I have now and optimize since it's already performing well, and we open a new issue for to do all the steps in [1]. There are still many things to clean up, but any cleaning/edge cases will not add in computational complexity with respect to the number of test points. e.g. I now convert the coordinates to integers and ignore the decimal part and there are now lots of collisions in the z-value hash, normalizing the data and adding a fixed max number of bits to compute the z-value (this is described towards the end of [3]) Any thoughts? > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. Reference [3] > is cited a few times in [1], and gives necessary background on the z-value > approach. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985273#comment-14985273 ] Daniel Blazevski edited comment on FLINK-1934 at 11/2/15 2:20 PM: -- [~till.rohrmann] , I have been looking to approximate knn algorithms. I found some more details about Ref [1] in Ref. [3], which I added in the Description above. Ref. [3] is cited several times in Ref. [1], and [3] mentions that the method of z-functions is best for dimensions < 10, and that locality sensitive hashing (LSH) is better for dimensions ~ 30 and above. How about I give Ref [1] a shot, and maybe later down the road add LSH? If so, feel free to assign me to this issue. was (Author: danielblazevski): [~till.rohrmann] , I have been looking to approximate knn algorithms. I found here in Ref. [3] that I added in the Description above. Ref. [3] is cited several times in Ref. [1], and [3] mentions that the method of z-functions is best for dimensions < 10, and that locality sensitive hashing (LSH) is better for dimensions ~ 30 and above. How about I give Ref [1] a shot, and maybe later down the road add LSH? If so, feel free to assign me to this issue. > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Raghav Chalapathy > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. Reference [3] > is cited a few times in [1], and gives necessary background on the z-value > approach. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Blazevski updated FLINK-1934: Description: kNN is still a widely used algorithm for classification and regression. However, due to the computational costs of an exact implementation, it does not scale well to large amounts of data. Therefore, it is worthwhile to also add an approximative kNN implementation as proposed in [1,2]. Resources: [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf was: kNN is still a widely used algorithm for classification and regression. However, due to the computational costs of an exact implementation, it does not scale well to large amounts of data. Therefore, it is worthwhile to also add an approximative kNN implementation as proposed in [1,2]. Resources: [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Raghav Chalapathy > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf > [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902478#comment-14902478 ] Daniel Blazevski commented on FLINK-1745: - [~till.rohrmann], I should have mentioned that I took this 2D quadtree: http://www.cs.trinity.edu/~mlewis/CSCI1321-F11/Code/src/util/Quadtree.scala and used that as a guide to form an n-Dimensional quadtree -- some generalizations were straightforward, others were bit tricky/fun (like partitioning an n-dim box into equal sub-boxes) > Add exact k-nearest-neighbours algorithm to machine learning library > > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901837#comment-14901837 ] Daniel Blazevski commented on FLINK-1745: - Hi [~chiwanpark] and [~till.rohrmann], I have implemented a Quadtree data structure to put into the KNN algorithm -- R-Tree version may follow. I will take each Block of the training DataSet in the existing KNN and make a quadtree before finding the KNN on each block. The quadtree class that I have is now a prototype -- using only Arrays/ListBuffer[Double] types and not yet any Flink data structures -- but before I transfer to Flink, let me know if you'd like the quadtree class put in a specific directory -- otherwise I will just put in the in the same directory as where KNN.scala is in (or maybe create a sub-directory where KNN.scala is) Let me also generally know if there is anything specif you'd like to see done to the quadtree when going from basic Scala code and importing to Flink; otherwise I will look at other Flink code as a guide. I have made a simple test of the quadtree, which I can also expand upon and include if you'd like. Cheers, Dan > Add exact k-nearest-neighbours algorithm to machine learning library > > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746552#comment-14746552 ] Daniel Blazevski commented on FLINK-1745: - Thanks! Pulled and built Chinwan's branch, and further modified very minor changes to his test to familiarize myself more with DataSets. Dan On Tue, Sep 15, 2015 at 4:47 AM, Till Rohrmann (JIRA)> Add exact k-nearest-neighbours algorithm to machine learning library > > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Daniel Blazevski > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740627#comment-14740627 ] Daniel Blazevski commented on FLINK-1934: - Hi [~till.rohrmann], Flink seems like a neat project, looking forward to contributing! I had indeed looked into H-zkNNJ in Ref [1]. As I had just mentioned in the exact kNN, I think it will be more wise for me to try an R-tree implementation of the exact kNN first for my first Flink contribution -- so in the meantime, how about we don't assign this to me quite yet. > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Raghav Chalapathy > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740584#comment-14740584 ] Daniel Blazevski commented on FLINK-1745: - [~till.rohrmann], Nice, I am going to give the R-tree implementation a shot. As this is my first Flink contribution, I'm going to attempt this first before the approximate kNN (about to post another message there now) > Add exact k-nearest-neighbours algorithm to machine learning library > > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740076#comment-14740076 ] Daniel Blazevski edited comment on FLINK-1745 at 9/11/15 3:07 AM: -- In fact, if what [~chiwanpark] has is thought to be a good base for a kNN algorithm, one option could be that I help to make it more efficient by modifying what he has and incorporate an R-tree ( https://en.wikipedia.org/wiki/R-tree ) ? was (Author: danielblazevski): In fact, if what [~chiwanpark] has is thought to be a good base for a kNN algorithm, one option could be that I help to make it more efficient by modifying what he has and incorporate an R-tree (https://en.wikipedia.org/wiki/R-tree)? > Add exact k-nearest-neighbours algorithm to machine learning library > > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740076#comment-14740076 ] Daniel Blazevski commented on FLINK-1745: - In fact, if what [~chiwanpark] has is thought to be a good base for a kNN algorithm, one option could be that I help to make it more efficient by modifying what he has and incorporate an R-tree (https://en.wikipedia.org/wiki/R-tree)? > Add exact k-nearest-neighbours algorithm to machine learning library > > > Key: FLINK-1745 > URL: https://issues.apache.org/jira/browse/FLINK-1745 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann > Labels: ML, Starter > > Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial > it is still used as a mean to classify data and to do regression. This issue > focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as > proposed in [2]. > Could be a starter task. > Resources: > [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] > [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739437#comment-14739437 ] Daniel Blazevski commented on FLINK-1934: - Hello, I have been thinking about implementing a distributed, scalable implementation of kNN and would like to know the current progress of anyone working on it before diving into the project. > Add approximative k-nearest-neighbours (kNN) algorithm to machine learning > library > -- > > Key: FLINK-1934 > URL: https://issues.apache.org/jira/browse/FLINK-1934 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Raghav Chalapathy > Labels: ML > > kNN is still a widely used algorithm for classification and regression. > However, due to the computational costs of an exact implementation, it does > not scale well to large amounts of data. Therefore, it is worthwhile to also > add an approximative kNN implementation as proposed in [1,2]. > Resources: > [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf > [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)