[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893508#comment-15893508
 ] 

Yun Ni commented on SPARK-19771:


[~merlin] What you are suggesting is to hash each AND hash vector into a single 
integer, which I don't think make sense. It does little improvement to running 
time since SparkSQL does a hash join and the chance of vector comparison is 
almost minimized. It improves the memory cost of each transformed row from 
O(NumHashFunctions*NumHashTables) to O(NumHashTables) but at the cost of 
increasing false positive rate especially when the NumHashFunctions is large.

>From user experience perspective, hiding the actual hash values from users is 
>a bad practice because users need to run their own algorithms based on the 
>hash values. Besides that, we expect users to increase the number of hash 
>functions when they want to lower the false positive rate. Hashing the vector 
>will increase the false positive rate again, which should not be expected.

> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893097#comment-15893097
 ] 

Yun Ni edited comment on SPARK-19771 at 3/2/17 9:55 PM:


[~merlin] 
(1) The computation cost is NumHashFunctions because we go through each index 
only once. I don't know what's N in the memory overhead?
(2) The hash values are not necessarily 0, 1, -1.
(3) If we really want a hash function of Vector, why not use Vector.hashCode?



was (Author: yunn):
[~merlin] 
(1) The computation cost is NumHashFunctions because we go through each index 
only once. I don't know what's N in the memory overhead?
(2) The hash values are not necessarily {0, 1, -1}.
(3) If we really want a hash function of Vector, why not use Vector.hashCode?


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893097#comment-15893097
 ] 

Yun Ni commented on SPARK-19771:


[~merlin] 
(1) The computation cost is NumHashFunctions because we go through each index 
only once. I don't know what's N in the memory overhead?
(2) The hash values are not necessarily {0, 1, -1}.
(3) If we really want a hash function of Vector, why not use Vector.hashCode?


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-02-28 Thread Yun Ni (JIRA)
Yun Ni created SPARK-19771:
--

 Summary: Support OR-AND amplification in Locality Sensitive 
Hashing (LSH)
 Key: SPARK-19771
 URL: https://issues.apache.org/jira/browse/SPARK-19771
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0
Reporter: Yun Ni


The current LSH implementation only supports AND-OR amplification. We need to 
discuss the following questions before we goes to implementations:
(1) Whether we should support OR-AND amplification
(2) What API changes we need for OR-AND amplification
(3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-21 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874956#comment-15874956
 ] 

Yun Ni edited comment on SPARK-18454 at 2/21/17 9:39 PM:
-

[~josephkb] [~sethah] [~mlnick] [~karlhigley] I opened a gDoc for discussions. 
Feel free to make comments and add any new sections:

https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit


was (Author: yunn):
[~josephkb] [~sethah] [~mlnick] I opened a gDoc for discussions. Feel free to 
make comments and add any new sections:

https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit

> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874956#comment-15874956
 ] 

Yun Ni commented on SPARK-18454:


[~josephkb] [~sethah] [~mlnick] I opened a gDoc for discussions. Feel free to 
make comments and add any new sections:

https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit

> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18286) Add Scala/Java/Python examples for MinHash and RandomProjection

2017-02-16 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni resolved SPARK-18286.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Add Scala/Java/Python examples for MinHash and RandomProjection
> ---
>
> Key: SPARK-18286
> URL: https://issues.apache.org/jira/browse/SPARK-18286
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, ML
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 2.2.0
>
>
> Add Scala/Java/Python examples for MinHash and RandomProjection



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867366#comment-15867366
 ] 

Yun Ni commented on SPARK-18392:


I agree with Seth. We need to first finish SPARK-18080 and SPARK-18450 before 
implementing new LSH subclasses. (SPARK-18454 could be a prerequisite as well 
but I am not sure)

[~merlin] Once the prerequisites are satisfied, you can take either BitSampling 
or SignRandomProjection and I can take the other one.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Comment Edited] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832487#comment-15832487
 ] 

Yun Ni edited comment on SPARK-18392 at 1/20/17 10:29 PM:
--

Yes, comparing if the hash signature equals is faster than computing the 
distance between the key and every instance.

But you are right. It's better to comparing with less number of records. I will 
think deeper and see if there is a better solution.


was (Author: yunn):
Yes, comparing if the hash signature equals is faster than computing the 
distance between the key and every instance.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832487#comment-15832487
 ] 

Yun Ni commented on SPARK-18392:


Yes, comparing if the hash signature equals is faster than computing the 
distance between the key and every instance.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832404#comment-15832404
 ] 

Yun Ni edited comment on SPARK-18392 at 1/20/17 9:15 PM:
-

Hi David,

Thanks for the question. I did not group the records by their hash signatures 
for 2 reasons:
(1) A large number of instances could have the same hash signature. In that 
case, the group-by operations could cause shuffle spill. See 
http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes
(2) Since Spark Datasets are lazy evaluated, tagging every instance with hash 
signature and comparing with every instance will run in the same staging. (Even 
more steps may run the same stage.) Thus it should not hurt the running time.

If you see any performance issues, please let us know.


was (Author: yunn):
Hi David,

Thanks for the question. I did not group the records by their hash signatures 
for 2 reasons:
(1) A large number of instances could have the same hash signature. In that 
case, the group-by operations could cause OOM or spill. See 
http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes
(2) Since Spark Datasets are lazy evaluated, tagging every instance with hash 
signature and comparing with every instance will run in the same staging. (Even 
more steps may run the same stage.) Thus it should not hurt the running time.

If you see any performance issues, please let us know.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries 

[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832404#comment-15832404
 ] 

Yun Ni commented on SPARK-18392:


Hi David,

Thanks for the question. I did not group the records by their hash signatures 
for 2 reasons:
(1) A large number of instances could have the same hash signature. In that 
case, the group-by operations could cause OOM or spill. See 
http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes
(2) Since Spark Datasets are lazy evaluated, tagging every instance with hash 
signature and comparing with every instance will run in the same staging. (Even 
more steps may run the same stage.) Thus it should not hurt the running time.

If you see any performance issues, please let us know.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> 

[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-17 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18454:
---
Description: 
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) How we should change the current Nearest Neighbor implementation to make it 
align with the MultiProbe NN Search from the origin paper: 
http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf

  was:
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) How we should change the current MultiProbe implementation to make it align 
with the MultiProbe NN Search from the origin paper: 
http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf


> Changes to fix Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) How we should change the current Nearest Neighbor implementation to make 
> it align with the MultiProbe NN Search from the origin paper: 
> http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-17 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18454:
---
Description: 
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) What are the issues and how we should change the current Nearest Neighbor 
implementation

  was:
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) How we should change the current Nearest Neighbor implementation to make it 
align with the MultiProbe NN Search from the origin paper: 
http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf


> Changes to fix Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-17 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18454:
---
Summary: Changes to fix Nearest Neighbor Search for LSH  (was: Changes to 
fix Multi-Probe Nearest Neighbor Search for LSH)

> Changes to fix Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) How we should change the current MultiProbe implementation to make it 
> align with the MultiProbe NN Search from the origin paper: 
> http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2016-11-15 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668372#comment-15668372
 ] 

Yun Ni commented on SPARK-18392:


Re-org the ticket dependencies for a little bit. PTAL.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18454) Changes to fix Multi-Probe Nearest Neighbor Search for LSH

2016-11-15 Thread Yun Ni (JIRA)
Yun Ni created SPARK-18454:
--

 Summary: Changes to fix Multi-Probe Nearest Neighbor Search for LSH
 Key: SPARK-18454
 URL: https://issues.apache.org/jira/browse/SPARK-18454
 Project: Spark
  Issue Type: Improvement
Reporter: Yun Ni


We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) How we should change the current MultiProbe implementation to make it align 
with the MultiProbe NN Search from the origin paper: 
http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing

2016-11-15 Thread Yun Ni (JIRA)
Yun Ni created SPARK-18450:
--

 Summary: Add AND-amplification to Locality Sensitive Hashing
 Key: SPARK-18450
 URL: https://issues.apache.org/jira/browse/SPARK-18450
 Project: Spark
  Issue Type: Improvement
Reporter: Yun Ni


We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. 
Once the change is applied, we can add AND-amplification to LSH implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-11-13 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18408:
---
Description: 
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to {{Array of Vector}} instead of {{Vectors}}
 - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} 
as the dimension of {{Vector}}
 - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} 
to {{MinHashLSH}}
 - Make randUnitVectors/randCoefficients private
 - Make Multi-Probe NN Search and {{hashDistance}} private for future discussion

  was:
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to {{Array of Vector}} instead of {{Vectors}}
 - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} 
as the dimension of {{Vector}}
 - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} 
to {{MinHashLSH}}
 - Make randUnitVectors/randCoefficients private


> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use {{numHashTables}} as the dimension of {{Array}} and 
> {{numHashFunctions}} as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private
>  - Make Multi-Probe NN Search and {{hashDistance}} private for future 
> discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-11-10 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18408:
---
Description: 
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to {{Array of Vector}} instead of {{Vectors}}
 - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} 
as the dimension of {{Vector}}
 - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} 
to {{MinHashLSH}}
 - Make randUnitVectors/randCoefficients private

  was:
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to {{Array of Vector}} instead of {{Vectors}}
 - Use `numHashTables` as the dimension of {{Array}} and {{numHashFunctions}} 
as the dimension of {{Vector}}
 - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} 
to {{MinHashLSH}}
 - Make randUnitVectors/randCoefficients private


> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use {{numHashTables}} as the dimension of {{Array}} and 
> {{numHashFunctions}} as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-11-10 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18408:
---
Description: 
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to {{Array of Vector}} instead of {{Vectors}}
 - Use `numHashTables` as the dimension of {{Array}} and {{numHashFunctions}} 
as the dimension of {{Vector}}
 - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} 
to {{MinHashLSH}}
 - Make randUnitVectors/randCoefficients private

  was:
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to `Array of Vector` instead of `Vectors`
 - Use `numHashTables` as the dimension of `Array` and `numHashFunctions` as 
the dimension of `Vector`
 - Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to 
`MinHashLSH`
 - Make randUnitVectors/randCoefficients private


> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use `numHashTables` as the dimension of {{Array}} and {{numHashFunctions}} 
> as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18408) API Improvements for LSH

2016-11-10 Thread Yun Ni (JIRA)
Yun Ni created SPARK-18408:
--

 Summary: API Improvements for LSH
 Key: SPARK-18408
 URL: https://issues.apache.org/jira/browse/SPARK-18408
 Project: Spark
  Issue Type: Improvement
Reporter: Yun Ni


As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to `Array of Vector` instead of `Vectors`
 - Use `numHashTables` as the dimension of `Array` and `numHashFunctions` as 
the dimension of `Vector`
 - Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to 
`MinHashLSH`
 - Make randUnitVectors/randCoefficients private



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2016-11-10 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654822#comment-15654822
 ] 

Yun Ni commented on SPARK-18392:


I think your summary is very good in general. For class names, I think 
`BucketedRandomProjectionLSH` and `SignRandomProjectionLSH` looks intuitive and 
won't cause confusion.

If we are breaking into subtasks, let us prioritize output schema first because 
it will be a big change that affect most parts of current implementation. If no 
objection, shall I open a subtask and make a PR for that?

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (SPARK-18334) MinHash should use binary hash distance

2016-11-07 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645506#comment-15645506
 ] 

Yun Ni edited comment on SPARK-18334 at 11/7/16 9:25 PM:
-

[~sethah] [~josephkb] Will send a PR to fix shortly.


was (Author: yunn):
[~sethah][~josephkb]

> MinHash should use binary hash distance
> ---
>
> Key: SPARK-18334
> URL: https://issues.apache.org/jira/browse/SPARK-18334
> Project: Spark
>  Issue Type: Bug
>Reporter: Yun Ni
>Priority: Trivial
>
> MinHash currently is using the same `hashDistance` function as 
> RandomProjection. This does not make sense for MinHash because the Jaccard 
> distance of two sets is not relevant to the absolute distance of their hash 
> buckets indices.
> This bug could affect accuracy of multi probing NN search for MinHash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18334) MinHash should use binary hash distance

2016-11-07 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645506#comment-15645506
 ] 

Yun Ni edited comment on SPARK-18334 at 11/7/16 9:25 PM:
-

[~sethah] [~josephkb] I will send a PR to fix shortly.


was (Author: yunn):
[~sethah] [~josephkb] Will send a PR to fix shortly.

> MinHash should use binary hash distance
> ---
>
> Key: SPARK-18334
> URL: https://issues.apache.org/jira/browse/SPARK-18334
> Project: Spark
>  Issue Type: Bug
>Reporter: Yun Ni
>Priority: Trivial
>
> MinHash currently is using the same `hashDistance` function as 
> RandomProjection. This does not make sense for MinHash because the Jaccard 
> distance of two sets is not relevant to the absolute distance of their hash 
> buckets indices.
> This bug could affect accuracy of multi probing NN search for MinHash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18334) MinHash should use binary hash distance

2016-11-07 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645506#comment-15645506
 ] 

Yun Ni commented on SPARK-18334:


[~sethah][~josephkb]

> MinHash should use binary hash distance
> ---
>
> Key: SPARK-18334
> URL: https://issues.apache.org/jira/browse/SPARK-18334
> Project: Spark
>  Issue Type: Bug
>Reporter: Yun Ni
>Priority: Trivial
>
> MinHash currently is using the same `hashDistance` function as 
> RandomProjection. This does not make sense for MinHash because the Jaccard 
> distance of two sets is not relevant to the absolute distance of their hash 
> buckets indices.
> This bug could affect accuracy of multi probing NN search for MinHash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18334) MinHash should use binary hash distance

2016-11-07 Thread Yun Ni (JIRA)
Yun Ni created SPARK-18334:
--

 Summary: MinHash should use binary hash distance
 Key: SPARK-18334
 URL: https://issues.apache.org/jira/browse/SPARK-18334
 Project: Spark
  Issue Type: Bug
Reporter: Yun Ni
Priority: Trivial


MinHash currently is using the same `hashDistance` function as 
RandomProjection. This does not make sense for MinHash because the Jaccard 
distance of two sets is not relevant to the absolute distance of their hash 
buckets indices.

This bug could affect accuracy of multi probing NN search for MinHash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-11-04 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637493#comment-15637493
 ] 

Yun Ni commented on SPARK-18081:


This is super helpful. Thanks!

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-11-04 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637102#comment-15637102
 ] 

Yun Ni commented on SPARK-18081:


Sorry, I was really overloaded this week. I will try my best to send a PR 
before EoW.

BTW, when I was writing ml-lsh.md, I could not find a place to get a preview. 
Do you have any guidelines for preview the Spark docs?

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581416#comment-15581416
 ] 

Yun Ni commented on SPARK-5992:
---

Yes, I have implemented cosine, jaccard, euclidean and hamming distance. Please 
check:
https://github.com/apache/spark/pull/15148

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581412#comment-15581412
 ] 

Yun Ni commented on SPARK-5992:
---

Yes, I do have comparison with full scan. Usually kNN with 1 hash function does 
not work well, so I enabled OR-amplification with multiple hash functions. You 
can set the number of hash functions using "outputDim".

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581345#comment-15581345
 ] 

Yun Ni commented on SPARK-5992:
---

Some tests on the WEX Open Dataset, with steps and results:
https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-09-19 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502493#comment-15502493
 ] 

Yun Ni commented on SPARK-5992:
---

Hi Joseph,

I have made an initial PR based on the design doc: 
https://github.com/apache/spark/pull/15148
Please take a look and I will test on the open dataset in the mean time.

Thanks very much,
Yun Ni

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-09-09 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478135#comment-15478135
 ] 

Yun Ni commented on SPARK-5992:
---

Thank you very much for reviewing it, Joseph!

I will work on the first commit according to the doc and send a PR soon. :)

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-31 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453980#comment-15453980
 ] 

Yun Ni commented on SPARK-5992:
---

Thanks very much for reviewing, Joseph!

Based on your comments, I have made the following major changes:
(1) Change the API Design to use Estimator instead of UnaryTransformer.
(2) Add a Testing section to elaborate more about the unit tests and scale 
tests design.

Please let me know if you have any further comments, so that we can go forward.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-22 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186
 ] 

Yun Ni edited comment on SPARK-5992 at 8/23/16 5:50 AM:


Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

Thanks,
Yun Ni


was (Author: yunn):
Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-22 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186
 ] 

Yun Ni commented on SPARK-5992:
---

Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org