from:"Yunni"

[GitHub] spark issue #16966: [SPARK-18409][ML]LSH approxNearestNeighbors should use a...

2017-05-06 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16966
  
@MLnick @jkbradley @sethah Could you take a review? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-05-06 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/17092
  
@MLnick @jkbradley @sethah Could you take a review? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-04-06 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/17092
  
Ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16966: [SPARK-18409][ML]LSH approxNearestNeighbors should use a...

2017-04-06 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16966
  
Ping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-03-09 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/17092
  
@jkbradley @sethah Please take a review when you have time. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16966: [SPARK-18409][ML]LSH approxNearestNeighbors should use a...

2017-03-09 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16966
  
@MLnick @jkbradley Please take a review when you have time. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17104: [MINOR][ML] Fix comments in LSH Examples and Python API

2017-02-28 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/17104
  
@srowen The full name works. Just want to make the comments shorter so that 
it's easier to read.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-28 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/17092
  
@jkbradley @MLnick Here is a clean PR. Sorry for messing up the previous 
one!

@merlintang I am happy to continue our discussion here: 
https://issues.apache.org/jira/browse/SPARK-19771 as OR-AND amplification 
requires much more changes than SPARK-18450


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17104: [MINOR][ML] Fix comments in LSH Examples and Pyth...

2017-02-28 Thread Yunni

GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/17104

[MINOR][ML] Fix comments in LSH Examples and Python API

## What changes were proposed in this pull request?
Remove `org.apache.spark.examples.` in 
Add slash in one of the python doc.

## How was this patch tested?
Run examples using the commands in the comments.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark yunn_minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17104






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r103361528
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,196 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A column "distCol" is
+ added to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each pair of rows. Use
+"distCol" as default value if it's not specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+ "datasetA" and "datasetB", and a column "distCol" is 
added to show the distance
+ between each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash values in the same
+dimension are calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
<https://en.wikipedia.org/wiki/Locality-sensitive_hashing#

[GitHub] spark pull request #17092: [SPARK-18450][ML] Scala API Change for LSH AND-am...

2017-02-27 Thread Yunni

GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/17092

[SPARK-18450][ML] Scala API Change for LSH AND-amplification

## What changes were proposed in this pull request?
Implemented a new Param numHashFunctions as the dimension of 
AND-amplification for Locality Sensitive Hashing. Now the hash of each feature 
in LSH is an array of size numHashTables while each element in the array is a 
vector of size numHashFunctions.

Two features are in the same hash bucket iff ANY pair of the vectors are 
equal (OR-amplification). Two vectors are equal iff ALL pair of the vector 
entries are equal (AND-amplification).

Will create follow-up PRs for Python API and Doc/Examples.

## How was this patch tested?
By running unit tests MinHashLSHSuite and BucketedRandomProjectionLSHSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark SPARK-18450

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17092.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17092


commit e6f9f9541f0b00c14b7c5a201b22aeb400eb9f19
Author: Yun Ni <y...@uber.com>
Date:   2017-02-16T20:54:22Z

Scala API Change for AND-amplification

commit 010acb2caf69ca0822db6aeb866cce21cdfcce4b
Author: Yunni <euler57...@gmail.com>
Date:   2017-02-27T03:43:21Z

Merge branch 'SPARK-18450' of https://github.com/Yunni/spark into 
SPARK-18450

commit 83a155699df4b15f1ab1fc427730613b63f7d1d6
Author: Yunni <euler57...@gmail.com>
Date:   2017-02-27T04:04:37Z

Fix typos in unit tests

commit 9dd87ba21a025939df7020ff1491a2c6c29f2d93
Author: Yunni <euler57...@gmail.com>
Date:   2017-02-28T02:04:10Z

Merge branch 'master' of https://github.com/apache/spark into SPARK-18450




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16965: [SPARK-18450][ML] Scala API Change for LSH AND-am...

2017-02-27 Thread Yunni

Github user Yunni closed the pull request at:

https://github.com/apache/spark/pull/16965


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
Looks like the rebase is making it even worse. I will reopen a PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16966: [SPARK-18409][ML]LSH approxNearestNeighbors should use a...

2017-02-26 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16966
  
@MLnick I did some experiments with WEX datasets. I have put the results in 
the description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-26 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
The number of rows would be O(LN). The memory usage will be different as 
the size of each row has changed before and after the explode. Also the 
Catalyst Optimizer may do projections during join which can also change the 
size of each row.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang Not exactly. Each row will explode to L rows, where L is the 
number of hash tables. Like the following:
```
++-++
|datasetA|entry|   hashValue|
++-++
|[[-10.0,-10.0],Wr...|0|[-2.0,-2.0,3.0,-2.0]|
|[[-10.0,-10.0],Wr...|1|[0.0,-3.0,-1.0,-2.0]|
|[[-10.0,-9.0],Wra...|0|[-2.0,-2.0,3.0,-2.0]|
|[[-10.0,-9.0],Wra...|1|[0.0,-3.0,-1.0,-2.0]|
|[[-10.0,-8.0],Wra...|0|[-2.0,-2.0,3.0,-1.0]|
|[[-10.0,-8.0],Wra...|1| [0.0,-3.0,0.0,-2.0]|
```
You can look at the code here: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala#L238


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang 
(1) `hashDistance` is only used for multi-probe NN Search. The term 
`numHashTables`, `numHashFunctions` is very hard to interpret in OR-AND cases.

(2) For similarity join, we actually first do explode and then join. The 
join key would be type of vector. 

(3) Yes. However, in order to get rows using hashes, we need to do 
intersections on large sets of rows. While in AND-OR cases, we do union of 
small sets of rows, which is more efficient.

I also suggest we limit the scope to the implementation of 
AND-amplification here. We can open other tickets to discuss memory issues, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang Sorry I still don't quite get why we need to support OR-AND 
when the effective threshold is low. My understanding is that we can always 
tune numHashTables and numHashFunctions for AND-OR to make the possibility as 
good as OR-AND. Please correct me if I am wrong.

My concerns on supporting OR-AND are the followings:
(1) We probably need some backward incompatible API changes. 
`Array[Vector]`, numHashTables, numHashFunctions seems to make less sense for 
OR-AND.
(2) To avoid broadcast join, we will need a very different and complicated 
mechanism for the join step in approxSimilarityJoin for OR-AND.
(3) I am thinking about building index to improve performance for nearest 
neighbor 
(https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit).
 Supporting OR-AND will make the index less efficient when we get records given 
hash buckets.

@jkbradley @sethah @MLnick Any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang We use AND-OR in both approxNearestNeighbor and 
approxSimilarityJoin, and it's more difficult for approxSimilarityJoin to adopt 
OR-AND than AND-OR.

My understanding: for a (d1, d2, p1, p2)-sensitive hash families, AND-OR 
can increase p1 and decrease p2 just like OR-AND does. What are the use cases 
to use OR-AND rather than AND-OR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-21 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16715
  
Hi @e-m-m, I think the Python API will be included in Spark 2.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16966: [SPARK-18409][ML]LSH approxNearestNeighbors shoul...

2017-02-20 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16966#discussion_r102065786
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -147,6 +148,15 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
 modelSubsetWithDistCol.sort(distCol).limit(numNearestNeighbors)
   }
 
+  private[feature] def approxNearestNeighbors(
+  dataset: Dataset[_],
+  key: Vector,
+  numNearestNeighbors: Int,
+  singleProbe: Boolean,
+  distCol: String): Dataset[_] = {
+approxNearestNeighbors(dataset, key, numNearestNeighbors, singleProbe, 
distCol, 0.05)
--- End diff --

Just an empirical relative error for approxQuantile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16966: [SPARK-18409][ML]LSH approxNearestNeighbors shoul...

2017-02-17 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16966#discussion_r101832855
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -147,6 +148,15 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
 modelSubsetWithDistCol.sort(distCol).limit(numNearestNeighbors)
   }
 
+  private[feature] def approxNearestNeighbors(
+  dataset: Dataset[_],
+  key: Vector,
+  numNearestNeighbors: Int,
+  singleProbe: Boolean,
+  distCol: String): Dataset[_] = {
+approxNearestNeighbors(dataset, key, numNearestNeighbors, singleProbe, 
distCol, 0.05)
--- End diff --

Let me know if the added Scaladoc makes sense to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16966: [SPARK-18409][ML]LSH approxNearestNeighbors shoul...

2017-02-16 Thread Yunni

GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/16966

[SPARK-18409][ML]LSH approxNearestNeighbors should use approxQuantile 
instead of sort

## What changes were proposed in this pull request?
In previous implementation of LSH approxNearestNeighbors, we have used 
sorting to get hashThreshold. By moving to approxQuantile, we can get as good 
results as the sort-based implementation while improving the running time a lot.

## How was this patch tested?
By running unit tests BucketedRandomProjectionLSHSuite and MinHashLSHSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark SPARK-18409

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16966.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16966


commit a3c6c0f86da47e0efa32f3d79846a69a4451517b
Author: Yun Ni <y...@uber.com>
Date:   2017-02-16T22:12:35Z

LSH approxNearestNeighbors should use approxQuantile instead of sort




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16965: [Spark-18450][ML] Scala API Change for LSH AND-am...

2017-02-16 Thread Yunni

GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/16965

[Spark-18450][ML] Scala API Change for LSH AND-amplification

## What changes were proposed in this pull request?
Implemented a new Param numHashFunctions as the dimension of 
AND-amplification for Locality Sensitive Hashing. Now the hash of each feature 
in LSH is an array of size numHashTables while each element in the array is a 
vector of size numHashFunctions.

Two features are in the same hash bucket iff ANY pair of the vectors are 
equal (OR-amplification). Two vectors are equal iff ALL pair of the vector 
entries are equal (AND-amplification).

## How was this patch tested?
By running unit tests MinHashLSHSuite and BucketedRandomProjectionLSHSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark SPARK-18450

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16965


commit e6f9f9541f0b00c14b7c5a201b22aeb400eb9f19
Author: Yun Ni <y...@uber.com>
Date:   2017-02-16T20:54:22Z

Scala API Change for AND-amplification




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-16 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16715
  
Sure. Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-14 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16715
  
@sethah Really appreciate your detailed code review and comments. :)
@MLnick @yanboliang Thank you for the help as well. Please let me know if 
you guys have any other comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-14 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r101089800
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -222,17 +222,18 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   }
 
   /**
-   * Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+   * Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
* the threshold. If the [[outputCol]] is missing, the method will 
transform the data; if the
* [[outputCol]] exists, it will use the [[outputCol]]. This allows 
caching of the transformed
* data when necessary.
*
* @param datasetA One of the datasets to join.
* @param datasetB Another dataset to join.
* @param threshold The threshold for the distance of row pairs.
-   * @param distCol Output column for storing the distance between each 
result row and the key.
+   * @param distCol Output column for storing the distance between each 
pair of rows.
* @return A joined dataset containing pairs of rows. The original rows 
are in columns
-   * "datasetA" and "datasetB", and a distCol is added to show the 
distance of each pair.
+   * "datasetA" and "datasetB", and a distCol is added to show the 
distance between each
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-14 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r101089807
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -37,38 +43,45 @@ object MinHashLSHExample {
   (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
   (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
   (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val dfB = spark.createDataFrame(Seq(
   (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
   (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
   (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))
 
 val mh = new MinHashLSH()
-  .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values")
+  .setNumHashTables(5)
+  .setInputCol("features")
+  .setOutputCol("hashes")
 
 val model = mh.fit(dfA)
 
 // Feature Transformation
+println("The hashed dataset where hashed values are stored in the 
column 'hashes':")
 model.transform(dfA).show()
-// Cache the transformed columns
-val transformedA = model.transform(dfA).cache()
-val transformedB = model.transform(dfB).cache()
 
-// Approximate similarity join
-model.approxSimilarityJoin(dfA, dfB, 0.6).show()
-model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
-// Self Join
-model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate
+// similarity join.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
+println("Approximately joining dfA and dfB on Jaccard distance smaller 
than 0.6:")
+model.approxSimilarityJoin(dfA, dfB, 0.6)
+  .select(col("datasetA.id").alias("idA"),
+col("datasetB.id").alias("idB"),
+col("distCol").alias("JaccardDistance")).show()
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-14 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r101089762
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +945,103 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)])` 
means there are 10 elements
+in the space. This set contains elements 2, 3, and 5. Also, any input 
vector must have at
+least 1 non-zero index, and all non-zero values are treated as binary 
"1" values.
+
+.. seealso:: `Wikipedia on MinHash 
<https://en.wikipedia.org/wiki/MinHash>`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> from pyspark.sql.functions import col
+>>> data = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+... (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+... (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df = spark.createDataFrame(data, ["id", "features"])
+>>> mh = MinHashLSH(inputCol="features", outputCol="hashes", 
seed=12345)
+>>> model = mh.fit(df)
+>>> model.transform(df).head()
+Row(id=0, features=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), 
hashes=[DenseVector([-1638925...
+>>> data2 = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+...  (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+...  (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df2 = spark.createDataFrame(data2, ["id", "features"])
+>>> key = Vectors.sparse(6, [1, 2], [1.0, 1.0])
+>>> model.approxNearestNeighbors(df2, key, 1).collect()
+[Row(id=5, features=SparseVector(6, {1: 1.0, 2: 1.0, 4: 1.0}), 
hashes=[DenseVector([-163892...
+>>> model.approxSimilarityJoin(df, df2, 0.6, 
distCol="JaccardDistance").select(
+... col("datasetA.id").alias("idA"),
+... col("datasetB.id").alias("idB"),
+... col("JaccardDistance")).show()
++---+---+---+
+|idA|idB|JaccardDistance|
++---+---+---+
+|  1|  4|0.5|
+|  0|  5|0.5|
++---+---+---+
+...
+>>> mhPath = temp_path + "/mh"
+>>> mh.save(mhPath)
+>>> mh2 = MinHashLSH.load(mhPath)
+>>> mh2.getOutputCol() == mh.getOutputCol()
+True
+>>> modelPath = temp_path + "/mh-model"
+>>> model.save(modelPath)
+>>> model2 = MinHashLSHModel.load(modelPath)
--- End diff --

Added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966548
  
--- Diff: docs/ml-features.md ---
@@ -1558,6 +1558,15 @@ for more details on the API.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
 
+
+
+
+Refer to the [BucketedRandomProjectionLSH Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH)
+for more details on the API.
+
+{% include_example python/ml/bucketed_random_projection_lsh.py %}
--- End diff --

Sorry I forgot to retest after renaming the python examples. Thanks for the 
in formation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966555
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966541
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -37,38 +38,44 @@ object MinHashLSHExample {
   (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
   (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
   (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val dfB = spark.createDataFrame(Seq(
   (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
   (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
   (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))
 
 val mh = new MinHashLSH()
-  .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values")
+  .setNumHashTables(5)
+  .setInputCol("features")
+  .setOutputCol("hashes")
 
 val model = mh.fit(dfA)
 
 // Feature Transformation
+println("The hashed dataset where hashed values are stored in the 
column 'hashes':")
 model.transform(dfA).show()
-// Cache the transformed columns
-val transformedA = model.transform(dfA).cache()
-val transformedB = model.transform(dfB).cache()
 
-// Approximate similarity join
-model.approxSimilarityJoin(dfA, dfB, 0.6).show()
-model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
-// Self Join
-model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate
+// similarity join.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
+println("Approximately joining dfA and dfB on Jaccard distance smaller 
than 0.6:")
+model.approxSimilarityJoin(dfA, dfB, 0.6)
+  .select(col("datasetA.id").alias("idA"),
+col("datasetB.id").alias("idB"),
+col("distCol").alias("JaccardDistance")).show()
 
-// Approximate nearest neighbor search
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate nearest
+// neighbor search.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxNearestNeighbors(transformedA, key, 2)`
+// It may return less than 2 rows because of lack of elements in the 
hash buckets.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966552
  
--- Diff: 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py ---
@@ -0,0 +1,81 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+from pyspark.sql.functions import col
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py
--- End diff --

Added in 4 places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966561
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
 ---
@@ -38,40 +39,45 @@ object BucketedRandomProjectionLSHExample {
   (1, Vectors.dense(1.0, -1.0)),
   (2, Vectors.dense(-1.0, -1.0)),
   (3, Vectors.dense(-1.0, 1.0))
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val dfB = spark.createDataFrame(Seq(
   (4, Vectors.dense(1.0, 0.0)),
   (5, Vectors.dense(-1.0, 0.0)),
   (6, Vectors.dense(0.0, 1.0)),
   (7, Vectors.dense(0.0, -1.0))
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val key = Vectors.dense(1.0, 0.0)
 
 val brp = new BucketedRandomProjectionLSH()
   .setBucketLength(2.0)
   .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values")
+  .setInputCol("features")
+  .setOutputCol("hashes")
 
 val model = brp.fit(dfA)
 
 // Feature Transformation
+println("The hashed dataset where hashed values are stored in the 
column 'hashes':")
 model.transform(dfA).show()
-// Cache the transformed columns
-val transformedA = model.transform(dfA).cache()
-val transformedB = model.transform(dfB).cache()
 
-// Approximate similarity join
-model.approxSimilarityJoin(dfA, dfB, 1.5).show()
-model.approxSimilarityJoin(transformedA, transformedB, 1.5).show()
-// Self Join
-model.approxSimilarityJoin(dfA, dfA, 2.5).filter("datasetA.id < 
datasetB.id").show()
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate
+// similarity join.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`
+println("Approximately joining dfA and dfB on Euclidean distance 
smaller than 1.5:")
+model.approxSimilarityJoin(dfA, dfB, 1.5)
+  .select(col("datasetA.id").alias("idA"),
+col("datasetB.id").alias("idB"),
+col("distCol").alias("EuclideanDistance")).show()
--- End diff --

Done in 6 places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966545
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
 ---
@@ -35,6 +35,8 @@
 import org.apache.spark.sql.types.Metadata;
 import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.functions.*;
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966554
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966534
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
--- End diff --

Removed in 4 places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala
 ---
@@ -111,8 +111,8 @@ class BucketedRandomProjectionLSHModel private[ml](
  * Euclidean distance metrics.
  *
  * The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
- * distance space. The output will be vectors of configurable dimension. 
Hash values in the
- * same dimension are calculated by the same hash function.
+ * distance space. The output will be vectors of configurable dimension. 
Hash values in the same
--- End diff --

Reverted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966530
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966537
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML] Python API & Examples for Locality Sen...

2017-02-08 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16715
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100199037
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
<https://en.wikipedia.o

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100198559
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
<https://en.wikipedia.o

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192059
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193058
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193020
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
+Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values
+are treated as binary "1" values.
+
+.. seealso:: `MinHash <https://en.wikipedia.org/wiki/MinHash>`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df = spark.createDataFrame(data, ["keys"])
+>>> mh = MinHashLSH(inputCol="keys", outputCol="values", seed=12345)
+>>> model = mh.fit(df)
+>>> model.transform(df).head()
+Row(keys=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), 
values=[DenseVector([-1638925712.0])])
+>>> data2 = [(Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df2 = spark.createDataFrame(data2, ["keys"])
+>>> key = Vectors.sparse(6, [1], [1.0])
+>>> model.approxNearestNeighbors(df2, key, 
1).select("distCol").head()[0]
+0.6...
+>>> model.approxSimilarityJoin(df, df2, 
1.0).select("distCol").head()[0]
+0.5
+>>> mhPath = temp_path + "/mh"
+>>> mh.save(mhPath)
+>>> mh2 = MinHashLSH.load(mhPath)
+>>> mh2.getOutputCol() == mh.getOutputCol()
+True
+>>> modelPath = temp_path + "/mh-model"
+>>> model.save(modelPath)
+>>> model2 = MinHashLSHModel.load(modelPath)
+
+.. versionadded:: 2.2.0
+"""
+
+@keyword_only
+def __init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+__init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+"""
+super(MinHashLSH, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.MinHashLSH", self.uid)
+self._setDefault(numHashTables=1)
+kwargs = self.__init__._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.2.0")
+def setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+Sets params for this MinHashLSH.
+"""
+kwargs = self.setParams._input_kwargs
+return self._set(**kwargs)
+
+def _create_model(self, java_model):
+return MinHashLSHModel(java_model)
+
+
+class MinHashLSHModel(JavaModel, LSHModel, JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Model produced by :py:class:`MinHashLSH`, where where multiple hash 
functions are stored. Each
+hash function is picked from the following family of hash functions, 
where :math:`a_i` and
+:math:`b_i` are randomly chosen integers less than prime:
+:math:`h_i(x) = ((x \cdot a_i + b_i) \mod prime)` This hash family is 
approximately min-wise
+independent according to the reference.
+
+.. seealso:: Tom Bohman, Colin Cooper, and Alan Frieze. "Min-wise 
independent linear \
+permutations." Electronic Journal of Combinatorics 7 (2000): R26.
+
+.. versionadded:: 2.2.0
+"""
+
+@property
+@since("2.2.0")
+def randCoefficients(self):
--- End diff --

Removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193043
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192985
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
--- End diff --

That was a mistake. Sorry about it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192933
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="keys", outputCol="values", 
bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

Done for Scala/Java/Python Examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192881
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192685
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
+
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
+
+# Self Join
+model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+
+# Approximate nearest neighbor search
+model.approxNearestNeighbors(dfA, key, 2).show()
--- End diff --

Increased the number of HashTables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192402
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
--- End diff --

Done in Scala/Java doc as well.


---
If your project is set up for it, you can reply to this email and have yo

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192347
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192333
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192298
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192314
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192074
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192026
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
--- End diff --

It's not alphabetized here because the declaration order matters for 
PySpark shell.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML] Python API & Examples for Locality Sen...

2017-02-06 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16715
  
@yanboliang, just a friendly reminder please don't forget to review the PR 
when you have time. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML] Python API & Examples for Locality Sen...

2017-01-28 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16715
  
Thanks very much, @yanboliang ~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML] Python API & Examples for Locality Sen...

2017-01-26 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16715
  
@yanboliang @jkbradley Please take a look. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-01-26 Thread Yunni

GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/16715

[Spark-18080][ML] Python API & Examples for Locality Sensitive Hashing

## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes 
was based on @yanboliang 's PR #15768 and resolved conflicts and API changes on 
the Scala API. The examples are consistent with Scala examples of MinHashLSH 
and BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark spark-18080

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16715.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16715


commit 85d22c37d3fe0b907f2eaf892729d087f9efb76c
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-11-04T14:22:23Z

Locality Sensitive Hashing (LSH) Python API.

commit cdeca1cdd8ed61274137c3012ba49ff57d459190
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-11-04T14:44:52Z

Fix typos.

commit 66d308bb6d5d254057b0de9217f87391f269aaed
Author: Yun Ni <y...@uber.com>
Date:   2017-01-25T21:11:57Z

Merge branch 'spark-18080' of https://github.com/yanboliang/spark into 
spark-18080

commit d62a2d0d6cdd1e4cb0626bacfe389274db42a11c
Author: Yun Ni <y...@uber.com>
Date:   2017-01-26T00:59:15Z

Merge branch 'master' of https://github.com/apache/spark into spark-18080

commit dafc4d120c0606ccd2be892fb2618a1df676ccd3
Author: Yun Ni <y...@uber.com>
Date:   2017-01-26T01:23:53Z

Changes to fix LSH Python API

commit ac1f4f7190192a3ee6fd8a311a0036e1546e4592
Author: Yunni <euler57...@gmail.com>
Date:   2017-01-26T05:08:47Z

Merge branch 'spark-18080' of https://github.com/Yunni/spark into 
spark-18080

commit 3a21f2666c907d6d520771b4343af7d877d689bb
Author: Yunni <euler57...@gmail.com>
Date:   2017-01-26T07:20:12Z

Fix examples and class definition

commit 65dab3ec32f423936f2cb310bbfbc312ece8ac54
Author: Yun Ni <y...@uber.com>
Date:   2017-01-26T20:19:22Z

Add python examples and updated the user guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15795: [SPARK-18081][ML][DOCS] Add user guide for Locality Sens...

2016-12-02 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15795
  
@MLnick @jkbradley I have changed the examples to be 1 example per 
algorithm which does transform, approxNearestNeighbor, and 
approxSimilarityJoin. PTAL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736878
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHas

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736862
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHas

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736883
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.MinHashLSH
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object ApproxSimilarityJoinExample {
+  def main(args: Array[String]): Unit = {
+// Creates a SparkSession
+val spark = SparkSession
+  .builder
+  .appName("ApproxSimilarityJoinExample")
+  .getOrCreate()
+
+// $example on$
+val dfA = spark.createDataFrame(Seq(
+  (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
+  (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
+  (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val dfB = spark.createDataFrame(Seq(
+  (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
+  (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
+  (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val mh = new MinHashLSH()
+  .setNumHashTables(5)
+  .setInputCol("keys")
+  .setOutputCol("values")
+
+val model = mh.fit(dfA)
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+
+// Cache the transformed columns
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736839
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHas

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736831
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHas

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736852
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHas

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736546
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHas

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736531
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA t

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736515
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081][ML][DOCS] Add user guide for Locali...

2016-12-02 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90736506
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15795: [SPARK-18081] Add user guide for Locality Sensitive Hash...

2016-11-28 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15795
  
@sethah I think so. I have made changes for the docs but I haven't made 
changes to the examples. Please take a look when you get a chance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-27 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15874
  
@jkbradley If you don't have more comments, can we merge this because I 
need to change the examples in #15795 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711243
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711247
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711255
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExampl

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711233
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExampl

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711231
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExampl

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711207
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExampl

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711204
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExampl

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711166
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apac

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711162
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711165
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
--- End diff --

Done. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15874
  
Thanks @sethah ! Your comment was very helpful and detailed :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15874
  
@sethah PTAL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89215405
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -31,36 +31,38 @@ import org.apache.spark.sql.types.StructType
 /**
  * :: Experimental ::
  *
- * Model produced by [[MinHash]], where multiple hash functions are 
stored. Each hash function is
- * a perfect hash function:
- *`h_i(x) = (x * k_i mod prime) mod numEntries`
- * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from 
`Z_prime^*`
+ * Model produced by [[MinHashLSH]], where multiple hash functions are 
stored. Each hash function is
+ * picked from a hash family for a specific set `S` with cardinality equal 
to `numEntries`:
+ *`h_i(x) = ((x \cdot a_i + b_i) \mod prime) \mod numEntries`
+ *
+ * This hash family is approximately min-wise independent according to the 
reference.
  *
  * Reference:
- * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on 
Perfect Hash Function]]
+ * 
[[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.8215=rep1=pdf
 Min-wise
+ * independent permutations]]
  *
- * @param numEntries The number of entries of the hash functions.
- * @param randCoefficients An array of random coefficients, each used by 
one hash function.
+ * @param randCoefficients Pairs of random coefficients. Each pair is used 
by one hash function.
  */
 @Experimental
 @Since("2.1.0")
-class MinHashModel private[ml] (
+class MinHashLSHModel private[ml](
 override val uid: String,
-@Since("2.1.0") val numEntries: Int,
-@Since("2.1.0") val randCoefficients: Array[Int])
-  extends LSHModel[MinHashModel] {
+private[ml] val randCoefficients: Array[(Int, Int)])
+  extends LSHModel[MinHashLSHModel] {
 
   @Since("2.1.0")
-  override protected[ml] val hashFunction: Vector => Vector = {
-elems: Vector =>
+  override protected[ml] val hashFunction: Vector => Array[Vector] = {
+elems: Vector => {
   require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
   val elemsList = elems.toSparse.indices.toList
-  val hashValues = randCoefficients.map({ randCoefficient: Int =>
-  elemsList.map({elem: Int =>
-(1 + elem) * randCoefficient.toLong % MinHash.prime % 
numEntries
-  }).min.toDouble
+  val hashValues = randCoefficients.map({ case (a: Int, b: Int) =>
+elemsList.map { elem: Int =>
+  ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
+}.min.toDouble
   })
-  Vectors.dense(hashValues)
+  // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
+  hashValues.grouped(1).map(Vectors.dense).toArray
--- End diff --

I see. It's `dense(firstValue: Double, otherValues: Double*)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89215190
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -112,25 +116,26 @@ class MinHash(override val uid: String) extends 
LSH[MinHashModel] with HasSeed {
   override def setOutputCol(value: String): this.type = 
super.setOutputCol(value)
 
   @Since("2.1.0")
-  override def setOutputDim(value: Int): this.type = 
super.setOutputDim(value)
+  override def setNumHashTables(value: Int): this.type = 
super.setNumHashTables(value)
 
   @Since("2.1.0")
   def this() = {
-this(Identifiable.randomUID("min hash"))
+this(Identifiable.randomUID("mh-lsh"))
   }
 
   /** @group setParam */
   @Since("2.1.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
   @Since("2.1.0")
-  override protected[ml] def createRawLSHModel(inputDim: Int): 
MinHashModel = {
-require(inputDim <= MinHash.prime / 2,
-  s"The input vector dimension $inputDim exceeds the threshold 
${MinHash.prime / 2}.")
+  override protected[ml] def createRawLSHModel(inputDim: Int): 
MinHashLSHModel = {
+require(inputDim <= MinHashLSH.HASH_PRIME,
+  s"The input vector dimension $inputDim exceeds the threshold 
${MinHashLSH.HASH_PRIME}.")
 val rand = new Random($(seed))
-val numEntry = inputDim * 2
-val randCoofs: Array[Int] = Array.fill($(outputDim))(1 + 
rand.nextInt(MinHash.prime - 1))
-new MinHashModel(uid, numEntry, randCoofs)
+val randCoefs: Array[(Int, Int)] = Array.fill(2 * $(numHashTables)) {
--- End diff --

Unit tests added in LSHTest.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89215142
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala ---
@@ -97,12 +118,31 @@ class MinHashSuite extends SparkFunSuite with 
MLlibTestSparkContext with Default
   (0 until 100).filter(_.toString.contains("1")).map((_, 1.0)))
 
 val (precision, recall) = LSHTest.calculateApproxNearestNeighbors(mh, 
dataset, key, 20,
-  singleProbing = true)
+  singleProbe = true)
 assert(precision >= 0.7)
 assert(recall >= 0.7)
   }
 
-  test("approxSimilarityJoin for minhash on different dataset") {
+  test("approxNearestNeighbors for numNeighbors <= 0") {
+val mh = new MinHashLSH()
+  .setNumHashTables(20)
+  .setInputCol("keys")
+  .setOutputCol("values")
+  .setSeed(12345)
+
+val key: Vector = Vectors.sparse(100,
+  (0 until 100).filter(_.toString.contains("1")).map((_, 1.0)))
+
+val model = mh.fit(dataset)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89175604
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala
 ---
@@ -43,70 +43,73 @@ class RandomProjectionSuite
   }
 
   test("params") {
-ParamsSuite.checkParams(new RandomProjection)
-val model = new RandomProjectionModel("rp", randUnitVectors = 
Array(Vectors.dense(1.0, 0.0)))
+ParamsSuite.checkParams(new BucketedRandomProjectionLSH)
+val model = new BucketedRandomProjectionLSHModel(
+  "brp", randUnitVectors = Array(Vectors.dense(1.0, 0.0)))
 ParamsSuite.checkParams(model)
   }
 
-  test("RandomProjection: default params") {
-val rp = new RandomProjection
-assert(rp.getOutputDim === 1.0)
+  test("BucketedRandomProjectionLSH: default params") {
+val brp = new BucketedRandomProjectionLSH
+assert(brp.getNumHashTables === 1.0)
   }
 
   test("read/write") {
-def checkModelData(model: RandomProjectionModel, model2: 
RandomProjectionModel): Unit = {
+def checkModelData(
+  model: BucketedRandomProjectionLSHModel,
+  model2: BucketedRandomProjectionLSHModel
+): Unit = {
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89175438
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -155,8 +148,30 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   }
 
   /**
-   * Overloaded method for approxNearestNeighbors. Use Single Probing as 
default way to search
-   * nearest neighbors and "distCol" as default distCol.
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it will use the [[outputCol]]. This allows 
caching of the
+   * transformed data when necessary.
+   *
+   * NOTE: This method is experimental and will likely change behavior in 
the next release.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89175473
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -155,8 +148,30 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   }
 
   /**
-   * Overloaded method for approxNearestNeighbors. Use Single Probing as 
default way to search
-   * nearest neighbors and "distCol" as default distCol.
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it will use the [[outputCol]]. This allows 
caching of the
+   * transformed data when necessary.
+   *
+   * NOTE: This method is experimental and will likely change behavior in 
the next release.
+   *
+   * @param dataset the dataset to search for nearest neighbors of the key
+   * @param key Feature vector representing the item to search for
+   * @param numNearestNeighbors The maximum number of nearest neighbors
+   * @param distCol Output column for storing the distance between each 
result row and the key
+   * @return A dataset containing at most k items closest to the key. A 
distCol is added to show
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89175497
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -31,36 +31,38 @@ import org.apache.spark.sql.types.StructType
 /**
  * :: Experimental ::
  *
- * Model produced by [[MinHash]], where multiple hash functions are 
stored. Each hash function is
- * a perfect hash function:
- *`h_i(x) = (x * k_i mod prime) mod numEntries`
- * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from 
`Z_prime^*`
+ * Model produced by [[MinHashLSH]], where multiple hash functions are 
stored. Each hash function is
+ * picked from a hash family for a specific set `S` with cardinality equal 
to `numEntries`:
+ *`h_i(x) = ((x \cdot a_i + b_i) \mod prime) \mod numEntries`
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89175448
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -155,8 +148,30 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   }
 
   /**
-   * Overloaded method for approxNearestNeighbors. Use Single Probing as 
default way to search
-   * nearest neighbors and "distCol" as default distCol.
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it will use the [[outputCol]]. This allows 
caching of the
+   * transformed data when necessary.
+   *
+   * NOTE: This method is experimental and will likely change behavior in 
the next release.
+   *
+   * @param dataset the dataset to search for nearest neighbors of the key
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r89175401
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -31,36 +31,38 @@ import org.apache.spark.sql.types.StructType
 /**
  * :: Experimental ::
  *
- * Model produced by [[MinHash]], where multiple hash functions are 
stored. Each hash function is
- * a perfect hash function:
- *`h_i(x) = (x * k_i mod prime) mod numEntries`
- * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from 
`Z_prime^*`
+ * Model produced by [[MinHashLSH]], where multiple hash functions are 
stored. Each hash function is
+ * picked from a hash family for a specific set `S` with cardinality equal 
to `numEntries`:
+ *`h_i(x) = ((x \cdot a_i + b_i) \mod prime) \mod numEntries`
+ *
+ * This hash family is approximately min-wise independent according to the 
reference.
  *
  * Reference:
- * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on 
Perfect Hash Function]]
+ * 
[[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.8215=rep1=pdf
 Min-wise
+ * independent permutations]]
  *
- * @param numEntries The number of entries of the hash functions.
- * @param randCoefficients An array of random coefficients, each used by 
one hash function.
+ * @param randCoefficients Pairs of random coefficients. Each pair is used 
by one hash function.
  */
 @Experimental
 @Since("2.1.0")
-class MinHashModel private[ml] (
+class MinHashLSHModel private[ml](
 override val uid: String,
-@Since("2.1.0") val numEntries: Int,
-@Since("2.1.0") val randCoefficients: Array[Int])
-  extends LSHModel[MinHashModel] {
+private[ml] val randCoefficients: Array[(Int, Int)])
+  extends LSHModel[MinHashLSHModel] {
 
   @Since("2.1.0")
-  override protected[ml] val hashFunction: Vector => Vector = {
-elems: Vector =>
+  override protected[ml] val hashFunction: Vector => Array[Vector] = {
+elems: Vector => {
   require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
   val elemsList = elems.toSparse.indices.toList
-  val hashValues = randCoefficients.map({ randCoefficient: Int =>
-  elemsList.map({elem: Int =>
-(1 + elem) * randCoefficient.toLong % MinHash.prime % 
numEntries
-  }).min.toDouble
+  val hashValues = randCoefficients.map({ case (a: Int, b: Int) =>
+elemsList.map { elem: Int =>
+  ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
+}.min.toDouble
   })
-  Vectors.dense(hashValues)
+  // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
+  hashValues.grouped(1).map(Vectors.dense).toArray
--- End diff --

Vectors.dense takes an array instead of a single number.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-18 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15874
  
Hi @sethah, grouping to a number of buckets does not really affect the 
independence since p is a mach larger prime. For example, in 
http://people.csail.mit.edu/mip/papers/kwise-lb/kwise-lb.pdf, they use "mod b".

Since we don't care about the hash universe here, I am OK with changing to 
`(ax + b mod p)` if you think that makes more sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-18 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15874
  
@jkbradley Awesome, thanks so much! :) Now that the API is finalized, I 
will work on the User Doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15874
  
Hi @jkbradley,

**MinHash** 
Yes, I agree that I shouldn't have said it's perfect hashing. 
Theoretically, it should be Min-wise Independent Permutation Family. What we 
used here is 2-independent (or 2-universal) hash families, which is 
approximately min-wise independent.
Reference: http://people.csail.mit.edu/mip/papers/kwise-lb/kwise-lb.pdf

**approxNearestNeighbors**
I still think in the case of OR-amplification, the only way is to scan a 
number of candidates k times the average bucket size. I would like to 
understand more about what you proposed. I have left the note in the scaladoc 
and let us have more discussion in future releases.

**AND-amplification**
I've open a ticket in SPARK-18450 for AND-amplification. I am wondering if 
we are including it in 2.1.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r88569303
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala
 ---
@@ -43,70 +43,72 @@ class RandomProjectionSuite
   }
 
   test("params") {
-ParamsSuite.checkParams(new RandomProjection)
-val model = new RandomProjectionModel("rp", randUnitVectors = 
Array(Vectors.dense(1.0, 0.0)))
+ParamsSuite.checkParams(new BucketedRandomProjectionLSH)
+val model = new BucketedRandomProjectionLSHModel(
+  "brp", randUnitVectors = Array(Vectors.dense(1.0, 0.0)))
 ParamsSuite.checkParams(model)
   }
 
-  test("RandomProjection: default params") {
-val rp = new RandomProjection
-assert(rp.getOutputDim === 1.0)
+  test("BucketedRandomProjectionLSH: default params") {
+val brp = new BucketedRandomProjectionLSH
+assert(brp.getNumHashTables === 1.0)
   }
 
   test("read/write") {
-def checkModelData(model: RandomProjectionModel, model2: 
RandomProjectionModel): Unit = {
+def checkModelData(
+  model: BucketedRandomProjectionLSHModel,
+  model2: BucketedRandomProjectionLSHModel
+): Unit = {
   model.randUnitVectors.zip(model2.randUnitVectors)
 .foreach(pair => assert(pair._1 === pair._2))
 }
-val mh = new RandomProjection()
+val mh = new BucketedRandomProjectionLSH()
 val settings = Map("inputCol" -> "keys", "outputCol" -> "values", 
"bucketLength" -> 1.0)
 testEstimatorAndModelReadWrite(mh, dataset, settings, checkModelData)
   }
 
   test("hashFunction") {
 val randUnitVectors = Array(Vectors.dense(0.0, 1.0), 
Vectors.dense(1.0, 0.0))
-val model = new RandomProjectionModel("rp", randUnitVectors)
+val model = new BucketedRandomProjectionLSHModel("brp", 
randUnitVectors)
 model.set(model.bucketLength, 0.5)
 val res = model.hashFunction(Vectors.dense(1.23, 4.56))
-assert(res.equals(Vectors.dense(9.0, 2.0)))
+assert(res(0).equals(Vectors.dense(9.0)))
--- End diff --

Added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 269 matches

Mail list logo