[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395065
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395345
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394053
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394630
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.MinHashLSH
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object ApproxSimilarityJoinExample {
+  def main(args: Array[String]): Unit = {
+// Creates a SparkSession
+val spark = SparkSession
+  .builder
+  .appName("ApproxSimilarityJoinExample")
+  .getOrCreate()
+
+// $example on$
+val dfA = spark.createDataFrame(Seq(
+  (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
+  (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
+  (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val dfB = spark.createDataFrame(Seq(
+  (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
+  (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
+  (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val mh = new MinHashLSH()
+  .setNumHashTables(5)
+  .setInputCol("keys")
+  .setOutputCol("values")
+
+val model = mh.fit(dfA)
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+
+// Cache the transformed columns
+val transformedA = model.transform(dfA)
+val transformedB = model.transform(dfB)
+model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
+
+// Self Join
+model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
--- End diff --

Just a note - will `approxSimilarityJoin` return duplicates? We should 
think about removing them automatically then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395495
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395294
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395451
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394871
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.MinHashLSH
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object MinHashLSHExample {
--- End diff --

This and the min hash transformation example are almost the same. Perhaps 
we can remove the transformation example, and adjust the user guide section to 
refer (and link) to the code examples for `MinHashLSH` and 
`BucketedRandomProjectionLSH`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90393584
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
--- End diff --

Should be updated to `MinHashLSH`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395459
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394571
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.MinHashLSH
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object ApproxSimilarityJoinExample {
+  def main(args: Array[String]): Unit = {
+// Creates a SparkSession
+val spark = SparkSession
+  .builder
+  .appName("ApproxSimilarityJoinExample")
+  .getOrCreate()
+
+// $example on$
+val dfA = spark.createDataFrame(Seq(
+  (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
+  (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
+  (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val dfB = spark.createDataFrame(Seq(
+  (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
+  (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
+  (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val mh = new MinHashLSH()
+  .setNumHashTables(5)
+  .setInputCol("keys")
+  .setOutputCol("values")
+
+val model = mh.fit(dfA)
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+
+// Cache the transformed columns
--- End diff --

This mentions caching but doesn't cache. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90393279
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
--- End diff --

Same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90393263
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
--- End diff --

This Scaladoc link should be for `BucketedRandomProjection` now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711243
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711247
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711255
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+
 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711233
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+
 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711231
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+
 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711207
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+
 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711204
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+
 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711166
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711162
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-27 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r89711165
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
--- End diff --

Done. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87116504
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
   

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87113033
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
--- End diff --

p2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87113839
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
--- End diff --

Actually, I would at least mention that `d` is a distance function. It is 
important in the context of LSH.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87113728
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
--- End diff --

You should either link to the definition of a metric space, or explain what 
M and d are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87116350
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+The bucket length can be used to trade off the performance of random 
projection. A larger bucket lowers the false negative rate but usually 
increases running time and false positive rate.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random [perfect hash 
function](https://en.wikipedia.org/wiki/Perfect_hash_function) `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
--- End diff --

You should mention this property, because it is not very intuitive and 
forms the basis of using MinHash to approximate Jacquard distance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87116745
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
   

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87116485
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
   

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87165516
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
--- End diff --

Perhaps we can say something like "also referred to as p-stable 
distributions" or similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87164912
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
--- End diff --

`Locality Sensitive Hashing(LSH)` will show up twice here - I think you can 
just keep the link one 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87175662
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87173480
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
--- End diff --

"... non-zero entry" perhaps


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87175435
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87168841
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
--- End diff --

This is a little unclear - should we say "RandomProjection accepts 
arbitrary vectors as input features, and supports both sparse and dense 
vectors". 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87169463
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
--- End diff --

Perhaps we can clarify here - "The input sets for MinHash are represented 
as _binary_ vectors, where the vector indices represent the elements themselves 
and the _non-zero_ values in the vector represent the presence of that element 
in the set. While both dense and sparse vectors are supported, typically sparse 
vectors are recommended for efficiency. For example, ..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87175248
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87175398
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87175604
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero indices.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/MinHashExample.scala 
%}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashExample.java %}
+

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87175944
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` 
that satisfy the following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random perfect hash function `g` to 
each elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
--- End diff --

Do we require (check) in `MinHash` that the input vectors are binary? Or do 
we just treat any non-zero value as 1 effectively? Maybe mention it whichever 
it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: 

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-09 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r87164985
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,149 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) 
Locality Sensitive Hashing(LSH) is an important class of hashing techniques, 
which is commonly used in clustering and outlier detection with large datasets. 
--- End diff --

You could mention approximate nearest neighbour search as a common use case 
too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86889880
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/RandomProjectionExample.scala
 ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.RandomProjection
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object RandomProjectionExample {
+  def main(args: Array[String]): Unit = {
+// Creates a SparkSession
+val spark = SparkSession
+  .builder
+  .appName("MinHashExample")
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86889863
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java
 ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.ml.feature.RandomProjection;
+import org.apache.spark.ml.feature.RandomProjectionModel;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+// $example off$
+
+public class JavaRandomProjectionExample {
+public static void main(String[] args) {
+SparkSession spark = SparkSession
+.builder()
+.appName("JavaMinHashExample")
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86889844
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+The bucket length can be used to trade off the performance of random 
projection. A larger bucket lowers the false negative rate but usually 
increases running time and false positive rate.
--- End diff --

Fixed. PTAL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86889702
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java
 ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.ml.feature.RandomProjection;
+import org.apache.spark.ml.feature.RandomProjectionModel;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+// $example off$
+
+public class JavaRandomProjectionExample {
+public static void main(String[] args) {
+SparkSession spark = SparkSession
+.builder()
--- End diff --

Used 2 space as in other java examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86889774
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
--- End diff --

Rephrased. PTAL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86877678
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
--- End diff --

No, that is tracked in #18082.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread bravo-zhang
Github user bravo-zhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86795061
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java
 ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.ml.feature.RandomProjection;
+import org.apache.spark.ml.feature.RandomProjectionModel;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+// $example off$
+
+public class JavaRandomProjectionExample {
+public static void main(String[] args) {
+SparkSession spark = SparkSession
+.builder()
+.appName("JavaMinHashExample")
--- End diff --

JavaRandomProjectionExample


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread bravo-zhang
Github user bravo-zhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86795179
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/RandomProjectionExample.scala
 ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.RandomProjection
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object RandomProjectionExample {
+  def main(args: Array[String]): Unit = {
+// Creates a SparkSession
+val spark = SparkSession
+  .builder
+  .appName("MinHashExample")
--- End diff --

RandomProjectionExample


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86742539
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java
 ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.ml.feature.RandomProjection;
+import org.apache.spark.ml.feature.RandomProjectionModel;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+// $example off$
+
+public class JavaRandomProjectionExample {
+public static void main(String[] args) {
+SparkSession spark = SparkSession
+.builder()
--- End diff --

Fix the indentation in many of the Java examples -- 4 space for 
continuation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86744662
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+The bucket length can be used to trade off the performance of random 
projection. A larger bucket lowers the false negative rate but usually 
increases running time and false positive rate.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random [perfect hash 
function](https://en.wikipedia.org/wiki/Perfect_hash_function) `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals 
the total number of elements in the space. Each dimension of the vectors 
represents the status of an elements: zero value means the elements is not in 
the set; non-zero value means the set contains the corresponding elements. For 
example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there 
are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
--- End diff --

Same sort of comment: isn't the intuition that MinHash approximates the 
Jaccard similarity without actually computing it completely?

This may again reduce to my different understanding of the technique, but 
MinHash isn't a hashing technique per se. it relies on a given family of hash 
functions to approximate set similarity. I'm finding it a little hard to view 
it as an 'LSH family'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86744734
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+The bucket length can be used to trade off the performance of random 
projection. A larger bucket lowers the false negative rate but usually 
increases running time and false positive rate.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/RandomProjectionExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java %}
+
+
+
+## MinHash for Jaccard Distance
--- End diff --

Similarity, not distance, right? it's higher when they overlap more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86746611
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
--- End diff --

Just an aside, but this random projection + cosine distance technique is 
the main thing I think of when I think of "LSH". Is that not implemented?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86743892
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
--- End diff --

Despite the opening sentence of the wikipedia article, I wouldn't class LSH 
as a dimensionality reduction technique? It's a set of hashing techniques where 
the hash preserves some properties. Maybe it's just my taste. But the rest of 
the text talks about the output as hash values.

What does "machine-learned ranking" refer to here? as this isn't a ranking 
technique per se.

I think this is missing a broad summary statement that indicates why LSH is 
even of interest: it provides a hash function where hashed values are in some 
sense close when the input values are close according to some metric. And then 
the variations below plug in different definitions of "close" and "input".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-07 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r86744459
  
--- Diff: docs/ml-features.md ---
@@ -1396,3 +1396,134 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a 
class of dimension reduction hash families, which can be used as both feature 
transformation and machine-learned ranking. Difference distance metric has its 
own LSH family class in `spark.ml`, which can transform feature columns to hash 
values as new columns. Besides feature transforming, `spark.ml` also 
implemented approximate nearest neighbor algorithm and approximate similarity 
join algorithm using LSH.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection 
for cosine 
distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length.
+
+The input features in Euclidean space are represented in vectors. Both 
sparse and dense vectors are supported.
+
+The bucket length can be used to trade off the performance of random 
projection. A larger bucket lowers the false negative rate but usually 
increases running time and false positive rate.
--- End diff --

Isn't the point here that near vectors end up in nearby buckets? I feel 
like this is skipping the intuition of why you would care about this technique 
or when you'd use it. Not like this needs a whole paragraph, just a few 
sentences of pointers?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-06 Thread Yunni
GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/15795

[SPARK-18081] Add user guide for Locality Sensitive Hashing(LSH)

## What changes were proposed in this pull request?
The user guide for LSH is added to ml-features.md, with several scala/java 
examples in spark-examples.

## How was this patch tested?
Doc has been generated through Jekyll, and checked through manual 
inspection.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark SPARK-18081-lsh-guide

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15795.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15795


commit 8c7971bdcff9eeedc9a97936b4c2e0aac93c4edf
Author: Yunni 
Date:   2016-11-07T07:23:36Z

[SPARK-18081] Add user guide to LSH




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org