[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893508#comment-15893508 ] Yun Ni commented on SPARK-19771: [~merlin] What you are suggesting is to hash each AND hash vector into a single integer, which I don't think make sense. It does little improvement to running time since SparkSQL does a hash join and the chance of vector comparison is almost minimized. It improves the memory cost of each transformed row from O(NumHashFunctions*NumHashTables) to O(NumHashTables) but at the cost of increasing false positive rate especially when the NumHashFunctions is large. >From user experience perspective, hiding the actual hash values from users is >a bad practice because users need to run their own algorithms based on the >hash values. Besides that, we expect users to increase the number of hash >functions when they want to lower the false positive rate. Hashing the vector >will increase the false positive rate again, which should not be expected. > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893097#comment-15893097 ] Yun Ni edited comment on SPARK-19771 at 3/2/17 9:55 PM: [~merlin] (1) The computation cost is NumHashFunctions because we go through each index only once. I don't know what's N in the memory overhead? (2) The hash values are not necessarily 0, 1, -1. (3) If we really want a hash function of Vector, why not use Vector.hashCode? was (Author: yunn): [~merlin] (1) The computation cost is NumHashFunctions because we go through each index only once. I don't know what's N in the memory overhead? (2) The hash values are not necessarily {0, 1, -1}. (3) If we really want a hash function of Vector, why not use Vector.hashCode? > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
[ https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893097#comment-15893097 ] Yun Ni commented on SPARK-19771: [~merlin] (1) The computation cost is NumHashFunctions because we go through each index only once. I don't know what's N in the memory overhead? (2) The hash values are not necessarily {0, 1, -1}. (3) If we really want a hash function of Vector, why not use Vector.hashCode? > Support OR-AND amplification in Locality Sensitive Hashing (LSH) > > > Key: SPARK-19771 > URL: https://issues.apache.org/jira/browse/SPARK-19771 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Yun Ni > > The current LSH implementation only supports AND-OR amplification. We need to > discuss the following questions before we goes to implementations: > (1) Whether we should support OR-AND amplification > (2) What API changes we need for OR-AND amplification > (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)
Yun Ni created SPARK-19771: -- Summary: Support OR-AND amplification in Locality Sensitive Hashing (LSH) Key: SPARK-19771 URL: https://issues.apache.org/jira/browse/SPARK-19771 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.0 Reporter: Yun Ni The current LSH implementation only supports AND-OR amplification. We need to discuss the following questions before we goes to implementations: (1) Whether we should support OR-AND amplification (2) What API changes we need for OR-AND amplification (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874956#comment-15874956 ] Yun Ni edited comment on SPARK-18454 at 2/21/17 9:39 PM: - [~josephkb] [~sethah] [~mlnick] [~karlhigley] I opened a gDoc for discussions. Feel free to make comments and add any new sections: https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit was (Author: yunn): [~josephkb] [~sethah] [~mlnick] I opened a gDoc for discussions. Feel free to make comments and add any new sections: https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit > Changes to improve Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874956#comment-15874956 ] Yun Ni commented on SPARK-18454: [~josephkb] [~sethah] [~mlnick] I opened a gDoc for discussions. Feel free to make comments and add any new sections: https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit > Changes to improve Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18286) Add Scala/Java/Python examples for MinHash and RandomProjection
[ https://issues.apache.org/jira/browse/SPARK-18286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni resolved SPARK-18286. Resolution: Fixed Fix Version/s: 2.2.0 > Add Scala/Java/Python examples for MinHash and RandomProjection > --- > > Key: SPARK-18286 > URL: https://issues.apache.org/jira/browse/SPARK-18286 > Project: Spark > Issue Type: Improvement > Components: Examples, ML >Reporter: Yanbo Liang >Priority: Minor > Fix For: 2.2.0 > > > Add Scala/Java/Python examples for MinHash and RandomProjection -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867366#comment-15867366 ] Yun Ni commented on SPARK-18392: I agree with Seth. We need to first finish SPARK-18080 and SPARK-18450 before implementing new LSH subclasses. (SPARK-18454 could be a prerequisite as well but I am not sure) [~merlin] Once the prerequisites are satisfied, you can take either BitSampling or SignRandomProjection and I can take the other one. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Comment Edited] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832487#comment-15832487 ] Yun Ni edited comment on SPARK-18392 at 1/20/17 10:29 PM: -- Yes, comparing if the hash signature equals is faster than computing the distance between the key and every instance. But you are right. It's better to comparing with less number of records. I will think deeper and see if there is a better solution. was (Author: yunn): Yes, comparing if the hash signature equals is faster than computing the distance between the key and every instance. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832487#comment-15832487 ] Yun Ni commented on SPARK-18392: Yes, comparing if the hash signature equals is faster than computing the distance between the key and every instance. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832404#comment-15832404 ] Yun Ni edited comment on SPARK-18392 at 1/20/17 9:15 PM: - Hi David, Thanks for the question. I did not group the records by their hash signatures for 2 reasons: (1) A large number of instances could have the same hash signature. In that case, the group-by operations could cause shuffle spill. See http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes (2) Since Spark Datasets are lazy evaluated, tagging every instance with hash signature and comparing with every instance will run in the same staging. (Even more steps may run the same stage.) Thus it should not hurt the running time. If you see any performance issues, please let us know. was (Author: yunn): Hi David, Thanks for the question. I did not group the records by their hash signatures for 2 reasons: (1) A large number of instances could have the same hash signature. In that case, the group-by operations could cause OOM or spill. See http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes (2) Since Spark Datasets are lazy evaluated, tagging every instance with hash signature and comparing with every instance will run in the same staging. (Even more steps may run the same stage.) Thus it should not hurt the running time. If you see any performance issues, please let us know. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832404#comment-15832404 ] Yun Ni commented on SPARK-18392: Hi David, Thanks for the question. I did not group the records by their hash signatures for 2 reasons: (1) A large number of instances could have the same hash signature. In that case, the group-by operations could cause OOM or spill. See http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes (2) Since Spark Datasets are lazy evaluated, tagging every instance with hash signature and comparing with every instance will run in the same staging. (Even more steps may run the same stage.) Thus it should not hurt the running time. If you see any performance issues, please let us know. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * >
[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18454: --- Description: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) How we should change the current Nearest Neighbor implementation to make it align with the MultiProbe NN Search from the origin paper: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf was: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) How we should change the current MultiProbe implementation to make it align with the MultiProbe NN Search from the origin paper: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf > Changes to fix Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) How we should change the current Nearest Neighbor implementation to make > it align with the MultiProbe NN Search from the origin paper: > http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18454: --- Description: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) What are the issues and how we should change the current Nearest Neighbor implementation was: We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) How we should change the current Nearest Neighbor implementation to make it align with the MultiProbe NN Search from the origin paper: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf > Changes to fix Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18454: --- Summary: Changes to fix Nearest Neighbor Search for LSH (was: Changes to fix Multi-Probe Nearest Neighbor Search for LSH) > Changes to fix Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) How we should change the current MultiProbe implementation to make it > align with the MultiProbe NN Search from the origin paper: > http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668372#comment-15668372 ] Yun Ni commented on SPARK-18392: Re-org the ticket dependencies for a little bit. PTAL. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18454) Changes to fix Multi-Probe Nearest Neighbor Search for LSH
Yun Ni created SPARK-18454: -- Summary: Changes to fix Multi-Probe Nearest Neighbor Search for LSH Key: SPARK-18454 URL: https://issues.apache.org/jira/browse/SPARK-18454 Project: Spark Issue Type: Improvement Reporter: Yun Ni We all agree to do the following improvement to Multi-Probe NN Search: (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing full sort on the whole dataset Currently we are still discussing the following: (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} (2) How we should change the current MultiProbe implementation to make it align with the MultiProbe NN Search from the origin paper: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing
Yun Ni created SPARK-18450: -- Summary: Add AND-amplification to Locality Sensitive Hashing Key: SPARK-18450 URL: https://issues.apache.org/jira/browse/SPARK-18450 Project: Spark Issue Type: Improvement Reporter: Yun Ni We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. Once the change is applied, we can add AND-amplification to LSH implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18408: --- Description: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to {{Array of Vector}} instead of {{Vectors}} - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} as the dimension of {{Vector}} - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} to {{MinHashLSH}} - Make randUnitVectors/randCoefficients private - Make Multi-Probe NN Search and {{hashDistance}} private for future discussion was: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to {{Array of Vector}} instead of {{Vectors}} - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} as the dimension of {{Vector}} - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} to {{MinHashLSH}} - Make randUnitVectors/randCoefficients private > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private > - Make Multi-Probe NN Search and {{hashDistance}} private for future > discussion -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18408: --- Description: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to {{Array of Vector}} instead of {{Vectors}} - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} as the dimension of {{Vector}} - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} to {{MinHashLSH}} - Make randUnitVectors/randCoefficients private was: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to {{Array of Vector}} instead of {{Vectors}} - Use `numHashTables` as the dimension of {{Array}} and {{numHashFunctions}} as the dimension of {{Vector}} - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} to {{MinHashLSH}} - Make randUnitVectors/randCoefficients private > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use {{numHashTables}} as the dimension of {{Array}} and > {{numHashFunctions}} as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18408) API Improvements for LSH
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Ni updated SPARK-18408: --- Description: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to {{Array of Vector}} instead of {{Vectors}} - Use `numHashTables` as the dimension of {{Array}} and {{numHashFunctions}} as the dimension of {{Vector}} - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} to {{MinHashLSH}} - Make randUnitVectors/randCoefficients private was: As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to `Array of Vector` instead of `Vectors` - Use `numHashTables` as the dimension of `Array` and `numHashFunctions` as the dimension of `Vector` - Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH` - Make randUnitVectors/randCoefficients private > API Improvements for LSH > > > Key: SPARK-18408 > URL: https://issues.apache.org/jira/browse/SPARK-18408 > Project: Spark > Issue Type: Improvement >Reporter: Yun Ni > > As the first improvements to current LSH Implementations, we are planning to > do the followings: > - Change output schema to {{Array of Vector}} instead of {{Vectors}} > - Use `numHashTables` as the dimension of {{Array}} and {{numHashFunctions}} > as the dimension of {{Vector}} > - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, > {{MinHash}} to {{MinHashLSH}} > - Make randUnitVectors/randCoefficients private -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18408) API Improvements for LSH
Yun Ni created SPARK-18408: -- Summary: API Improvements for LSH Key: SPARK-18408 URL: https://issues.apache.org/jira/browse/SPARK-18408 Project: Spark Issue Type: Improvement Reporter: Yun Ni As the first improvements to current LSH Implementations, we are planning to do the followings: - Change output schema to `Array of Vector` instead of `Vectors` - Use `numHashTables` as the dimension of `Array` and `numHashFunctions` as the dimension of `Vector` - Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH` - Make randUnitVectors/randCoefficients private -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654822#comment-15654822 ] Yun Ni commented on SPARK-18392: I think your summary is very good in general. For class names, I think `BucketedRandomProjectionLSH` and `SignRandomProjectionLSH` looks intuitive and won't cause confusion. If we are breaking into subtasks, let us prioritize output schema first because it will be a big change that affect most parts of current implementation. If no objection, shall I open a subtask and make a PR for that? > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (SPARK-18334) MinHash should use binary hash distance
[ https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645506#comment-15645506 ] Yun Ni edited comment on SPARK-18334 at 11/7/16 9:25 PM: - [~sethah] [~josephkb] Will send a PR to fix shortly. was (Author: yunn): [~sethah][~josephkb] > MinHash should use binary hash distance > --- > > Key: SPARK-18334 > URL: https://issues.apache.org/jira/browse/SPARK-18334 > Project: Spark > Issue Type: Bug >Reporter: Yun Ni >Priority: Trivial > > MinHash currently is using the same `hashDistance` function as > RandomProjection. This does not make sense for MinHash because the Jaccard > distance of two sets is not relevant to the absolute distance of their hash > buckets indices. > This bug could affect accuracy of multi probing NN search for MinHash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18334) MinHash should use binary hash distance
[ https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645506#comment-15645506 ] Yun Ni edited comment on SPARK-18334 at 11/7/16 9:25 PM: - [~sethah] [~josephkb] I will send a PR to fix shortly. was (Author: yunn): [~sethah] [~josephkb] Will send a PR to fix shortly. > MinHash should use binary hash distance > --- > > Key: SPARK-18334 > URL: https://issues.apache.org/jira/browse/SPARK-18334 > Project: Spark > Issue Type: Bug >Reporter: Yun Ni >Priority: Trivial > > MinHash currently is using the same `hashDistance` function as > RandomProjection. This does not make sense for MinHash because the Jaccard > distance of two sets is not relevant to the absolute distance of their hash > buckets indices. > This bug could affect accuracy of multi probing NN search for MinHash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18334) MinHash should use binary hash distance
[ https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645506#comment-15645506 ] Yun Ni commented on SPARK-18334: [~sethah][~josephkb] > MinHash should use binary hash distance > --- > > Key: SPARK-18334 > URL: https://issues.apache.org/jira/browse/SPARK-18334 > Project: Spark > Issue Type: Bug >Reporter: Yun Ni >Priority: Trivial > > MinHash currently is using the same `hashDistance` function as > RandomProjection. This does not make sense for MinHash because the Jaccard > distance of two sets is not relevant to the absolute distance of their hash > buckets indices. > This bug could affect accuracy of multi probing NN search for MinHash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18334) MinHash should use binary hash distance
Yun Ni created SPARK-18334: -- Summary: MinHash should use binary hash distance Key: SPARK-18334 URL: https://issues.apache.org/jira/browse/SPARK-18334 Project: Spark Issue Type: Bug Reporter: Yun Ni Priority: Trivial MinHash currently is using the same `hashDistance` function as RandomProjection. This does not make sense for MinHash because the Jaccard distance of two sets is not relevant to the absolute distance of their hash buckets indices. This bug could affect accuracy of multi probing NN search for MinHash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637493#comment-15637493 ] Yun Ni commented on SPARK-18081: This is super helpful. Thanks! > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637102#comment-15637102 ] Yun Ni commented on SPARK-18081: Sorry, I was really overloaded this week. I will try my best to send a PR before EoW. BTW, when I was writing ml-lsh.md, I could not find a place to get a preview. Do you have any guidelines for preview the Spark docs? > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581416#comment-15581416 ] Yun Ni commented on SPARK-5992: --- Yes, I have implemented cosine, jaccard, euclidean and hamming distance. Please check: https://github.com/apache/spark/pull/15148 > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581412#comment-15581412 ] Yun Ni commented on SPARK-5992: --- Yes, I do have comparison with full scan. Usually kNN with 1 hash function does not work well, so I enabled OR-amplification with multiple hash functions. You can set the number of hash functions using "outputDim". > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581345#comment-15581345 ] Yun Ni commented on SPARK-5992: --- Some tests on the WEX Open Dataset, with steps and results: https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502493#comment-15502493 ] Yun Ni commented on SPARK-5992: --- Hi Joseph, I have made an initial PR based on the design doc: https://github.com/apache/spark/pull/15148 Please take a look and I will test on the open dataset in the mean time. Thanks very much, Yun Ni > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478135#comment-15478135 ] Yun Ni commented on SPARK-5992: --- Thank you very much for reviewing it, Joseph! I will work on the first commit according to the doc and send a PR soon. :) > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453980#comment-15453980 ] Yun Ni commented on SPARK-5992: --- Thanks very much for reviewing, Joseph! Based on your comments, I have made the following major changes: (1) Change the API Design to use Estimator instead of UnaryTransformer. (2) Add a Testing section to elaborate more about the unit tests and scale tests design. Please let me know if you have any further comments, so that we can go forward. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186 ] Yun Ni edited comment on SPARK-5992 at 8/23/16 5:50 AM: Hi, We are engineers from Uber. Here is our design doc for LSH: https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit Please take a look and let us know if this meets your requirements or not. Thanks, Yun Ni was (Author: yunn): Hi, We are engineers from Uber. Here is our design doc for LSH: https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit Please take a look and let us know if this meets your requirements or not. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186 ] Yun Ni commented on SPARK-5992: --- Hi, We are engineers from Uber. Here is our design doc for LSH: https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit Please take a look and let us know if this meets your requirements or not. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org