[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100426821
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java 
---
@@ -44,25 +45,67 @@ public static void main(String[] args) {
   .getOrCreate();
 
 // $example on$
-List data = Arrays.asList(
+List dataA = Arrays.asList(
   RowFactory.create(0, Vectors.sparse(6, new int[]{0, 1, 2}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 4}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(2, Vectors.sparse(6, new int[]{0, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
 );
 
+List dataB = Arrays.asList(
+  RowFactory.create(0, Vectors.sparse(6, new int[]{1, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(2, Vectors.sparse(6, new int[]{1, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
+);
+
 StructType schema = new StructType(new StructField[]{
   new StructField("id", DataTypes.IntegerType, false, 
Metadata.empty()),
-  new StructField("keys", new VectorUDT(), false, Metadata.empty())
+  new StructField("features", new VectorUDT(), false, Metadata.empty())
 });
-Dataset dataFrame = spark.createDataFrame(data, schema);
+Dataset dfA = spark.createDataFrame(dataA, schema);
+Dataset dfB = spark.createDataFrame(dataB, schema);
+
+int[] indicies = {1, 3};
+double[] values = {1.0, 1.0};
+Vector key = Vectors.sparse(6, indicies, values);
 
 MinHashLSH mh = new MinHashLSH()
-  .setNumHashTables(1)
-  .setInputCol("keys")
-  .setOutputCol("values");
+  .setNumHashTables(5)
+  .setInputCol("features")
+  .setOutputCol("hashes");
+
+MinHashLSHModel model = mh.fit(dfA);
+
+// Feature Transformation
+System.out.println("The hashed dataset where hashed values are stored 
in the column 'values':");
+model.transform(dfA).show();
+// Cache the transformed columns
+Dataset transformedA = model.transform(dfA).cache();
+Dataset transformedB = model.transform(dfB).cache();
+
+// Approximate similarity join
+System.out.println("Approximately joining dfA and dfB on distance 
smaller than 0.6:");
+model.approxSimilarityJoin(dfA, dfB, 0.6)
+  .select("datasetA.id", "datasetB.id", "distCol")
+  .show();
+System.out.println("Joining cached datasets to avoid recomputing the 
hash values:");
+model.approxSimilarityJoin(transformedA, transformedB, 0.6)
+  .select("datasetA.id", "datasetB.id", "distCol")
+  .show();
+
+// Self Join
+System.out.println("Approximately self join of dfB on distance smaller 
than 0.6:");
--- End diff --

same comments as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100424220
  
--- Diff: 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py ---
@@ -0,0 +1,86 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "features"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "features"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="features", 
outputCol="hashes", bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+print("The hashed dataset where hashed values are stored in the column 
'values':")
+model.transform(dfA).show()
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+print("Approximately joining dfA and dfB on distance smaller than 
1.5:")
+model.approxSimilarityJoin(dfA, dfB, 1.5)\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+print("Joining cached datasets to avoid recomputing the hash values:")
+model.approxSimilarityJoin(transformedA, transformedB, 1.5)\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+
+# Self Join
+print("Approximately self join of dfB on distance smaller than 2.5:")
+model.approxSimilarityJoin(dfA, dfA, 2.5).filter("datasetA.id < 
datasetB.id")\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+
+# Approximate nearest neighbor search
+print("Approximately searching dfA for 2 nearest neighbors of the 
key:")
+model.approxNearestNeighbors(dfA, key, 2).show()
+print("Searching cached dataset to avoid recomputing the hash values:")
--- End diff --

Still think this is confusing here too. maybe we could leave comments:

python
# compute the locality-sensitive hashes for the input rows, then perform 
approximate nearest neighbor search.
# we could avoid computing hashes by passing in the already-transformed 
dataframe, e.g. 
# `model.approxNearestNeighbors(transformedA, key, 2)`
model.approxNearestNeighbors(dfA, key, 2).show()


This seems better to me because before you'd have to read the code to see 
the difference between the two cases anyway. Now, those who read the code can 
_still_ see the difference from the comment, and those who just run the example 
from the command line will not be confused or waste time trying to understand 
why we output the exact same table twice with different explanations. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100422395
  
--- Diff: 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py ---
@@ -0,0 +1,86 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "features"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "features"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="features", 
outputCol="hashes", bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+print("The hashed dataset where hashed values are stored in the column 
'values':")
+model.transform(dfA).show()
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+print("Approximately joining dfA and dfB on distance smaller than 
1.5:")
+model.approxSimilarityJoin(dfA, dfB, 1.5)\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+print("Joining cached datasets to avoid recomputing the hash values:")
+model.approxSimilarityJoin(transformedA, transformedB, 1.5)\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+
+# Self Join
+print("Approximately self join of dfB on distance smaller than 2.5:")
+model.approxSimilarityJoin(dfA, dfA, 2.5).filter("datasetA.id < 
datasetB.id")\
--- End diff --

I really do not think this is necessary. If you can do an approx similarity 
join between two datasets it will hopefully be obvious that you can do a self 
join. This is also confusing, since the name of the input is `dfA` in both 
cases, but the filter and select logic reference a `datasetB` and a `datasetA`. 
At first I thought it was wrong, then I realized that those are the default 
names, but that's not clear unless you read the API docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100427531
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,196 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash values in the same
+dimension are calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(0, Vectors.dense([-1.0, -1.0 ]),),
+...

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100427045
  
--- Diff: examples/src/main/python/ml/min_hash_lsh_example.py ---
@@ -0,0 +1,85 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh_example.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "features"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "features"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="features", outputCol="hashes", 
numHashTables=5)
+model = mh.fit(dfA)
+
+# Feature Transformation
+print("The hashed dataset where hashed values are stored in the column 
'values':")
+model.transform(dfA).show()
+
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+print("Approximately joining dfA and dfB on distance smaller than 
0.6:")
+model.approxSimilarityJoin(dfA, dfB, 0.6)\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+print("Joining cached datasets to avoid recomputing the hash values:")
+model.approxSimilarityJoin(transformedA, transformedB, 0.6)\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+
+# Self Join
+print("Approximately self join of dfB on distance smaller than 0.6:")
--- End diff --

reword as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100421903
  
--- Diff: 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py ---
@@ -0,0 +1,86 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "features"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "features"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="features", 
outputCol="hashes", bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+print("The hashed dataset where hashed values are stored in the column 
'values':")
+model.transform(dfA).show()
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+print("Approximately joining dfA and dfB on distance smaller than 
1.5:")
+model.approxSimilarityJoin(dfA, dfB, 1.5)\
+.select("datasetA.id", "datasetB.id", "distCol").show()
+print("Joining cached datasets to avoid recomputing the hash values:")
--- End diff --

I see what we're going for here, but this is pretty confusing. I think if 
we could somehow be more clear about the fact that if the datasets that we pass 
into this function already contain the output column (the "hashes" column) then 
we won't have to recompute it. I spent a bit here just trying to discern the 
difference between these two lines. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100426756
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java 
---
@@ -44,25 +45,67 @@ public static void main(String[] args) {
   .getOrCreate();
 
 // $example on$
-List data = Arrays.asList(
+List dataA = Arrays.asList(
   RowFactory.create(0, Vectors.sparse(6, new int[]{0, 1, 2}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 4}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(2, Vectors.sparse(6, new int[]{0, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
 );
 
+List dataB = Arrays.asList(
+  RowFactory.create(0, Vectors.sparse(6, new int[]{1, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(2, Vectors.sparse(6, new int[]{1, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
+);
+
 StructType schema = new StructType(new StructField[]{
   new StructField("id", DataTypes.IntegerType, false, 
Metadata.empty()),
-  new StructField("keys", new VectorUDT(), false, Metadata.empty())
+  new StructField("features", new VectorUDT(), false, Metadata.empty())
 });
-Dataset dataFrame = spark.createDataFrame(data, schema);
+Dataset dfA = spark.createDataFrame(dataA, schema);
+Dataset dfB = spark.createDataFrame(dataB, schema);
+
+int[] indicies = {1, 3};
+double[] values = {1.0, 1.0};
+Vector key = Vectors.sparse(6, indicies, values);
 
 MinHashLSH mh = new MinHashLSH()
-  .setNumHashTables(1)
-  .setInputCol("keys")
-  .setOutputCol("values");
+  .setNumHashTables(5)
+  .setInputCol("features")
+  .setOutputCol("hashes");
+
+MinHashLSHModel model = mh.fit(dfA);
+
+// Feature Transformation
+System.out.println("The hashed dataset where hashed values are stored 
in the column 'values':");
--- End diff --

not values anymore


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100426237
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
 ---
@@ -71,25 +71,32 @@ public static void main(String[] args) {
 BucketedRandomProjectionLSH mh = new BucketedRandomProjectionLSH()
   .setBucketLength(2.0)
   .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values");
+  .setInputCol("features")
+  .setOutputCol("hashes");
 
 BucketedRandomProjectionLSHModel model = mh.fit(dfA);
 
 // Feature Transformation
+System.out.println("The hashed dataset where hashed values are stored 
in the column 'values':");
 model.transform(dfA).show();
 // Cache the transformed columns
 Dataset transformedA = model.transform(dfA).cache();
 Dataset transformedB = model.transform(dfB).cache();
 
 // Approximate similarity join
+System.out.println("Approximately joining dfA and dfB on distance 
smaller than 1.5:");
 model.approxSimilarityJoin(dfA, dfB, 1.5).show();
+System.out.println("Joining cached datasets to avoid recomputing the 
hash values:");
 model.approxSimilarityJoin(transformedA, transformedB, 1.5).show();
+
 // Self Join
+System.out.println("Approximately self join of dfB on distance smaller 
than 2.5:");
--- End diff --

I don't think the self join is necessary, but I'll defer to others. Also, 
"Approximate self join" and it's dfA not dfB


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100427633
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +947,101 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)])` 
means there are 10 elements
+in the space. This set contains elements 2, 3, and 5. Also, any input 
vector must have at
+least 1 non-zero index, and all non-zero values are treated as binary 
"1" values.
+
+.. seealso:: `Wikipedia on MinHash 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+... (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+... (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df = spark.createDataFrame(data, ["id", "features"])
+>>> mh = MinHashLSH(inputCol="features", outputCol="hashes", 
seed=12345)
+>>> model = mh.fit(df)
+>>> model.transform(df).head()
+Row(id=0, features=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), 
hashes=[DenseVector([-1638925...
+>>> data2 = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+...  (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+...  (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df2 = spark.createDataFrame(data2, ["id", "features"])
+>>> key = Vectors.sparse(6, [1, 2], [1.0, 1.0])
+>>> model.approxNearestNeighbors(df2, key, 1).collect()
+[Row(id=5, features=SparseVector(6, {1: 1.0, 2: 1.0, 4: 1.0}), 
hashes=[DenseVector([-163892...
+>>> model.approxSimilarityJoin(df, df2, 0.6).select("datasetA.id",
--- End diff --

same as above, except distcol could be "jaccardSimilarity" or something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100426683
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java 
---
@@ -44,25 +45,67 @@ public static void main(String[] args) {
   .getOrCreate();
 
 // $example on$
-List data = Arrays.asList(
+List dataA = Arrays.asList(
   RowFactory.create(0, Vectors.sparse(6, new int[]{0, 1, 2}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 4}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(2, Vectors.sparse(6, new int[]{0, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
 );
 
+List dataB = Arrays.asList(
+  RowFactory.create(0, Vectors.sparse(6, new int[]{1, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(2, Vectors.sparse(6, new int[]{1, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
+);
+
 StructType schema = new StructType(new StructField[]{
   new StructField("id", DataTypes.IntegerType, false, 
Metadata.empty()),
-  new StructField("keys", new VectorUDT(), false, Metadata.empty())
+  new StructField("features", new VectorUDT(), false, Metadata.empty())
 });
-Dataset dataFrame = spark.createDataFrame(data, schema);
+Dataset dfA = spark.createDataFrame(dataA, schema);
+Dataset dfB = spark.createDataFrame(dataB, schema);
+
+int[] indicies = {1, 3};
--- End diff --

"indices"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100428492
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java 
---
@@ -44,25 +45,67 @@ public static void main(String[] args) {
   .getOrCreate();
 
 // $example on$
-List data = Arrays.asList(
+List dataA = Arrays.asList(
   RowFactory.create(0, Vectors.sparse(6, new int[]{0, 1, 2}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 4}, new 
double[]{1.0, 1.0, 1.0})),
   RowFactory.create(2, Vectors.sparse(6, new int[]{0, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
 );
 
+List dataB = Arrays.asList(
+  RowFactory.create(0, Vectors.sparse(6, new int[]{1, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(1, Vectors.sparse(6, new int[]{2, 3, 5}, new 
double[]{1.0, 1.0, 1.0})),
+  RowFactory.create(2, Vectors.sparse(6, new int[]{1, 2, 4}, new 
double[]{1.0, 1.0, 1.0}))
+);
+
 StructType schema = new StructType(new StructField[]{
   new StructField("id", DataTypes.IntegerType, false, 
Metadata.empty()),
-  new StructField("keys", new VectorUDT(), false, Metadata.empty())
+  new StructField("features", new VectorUDT(), false, Metadata.empty())
 });
-Dataset dataFrame = spark.createDataFrame(data, schema);
+Dataset dfA = spark.createDataFrame(dataA, schema);
+Dataset dfB = spark.createDataFrame(dataB, schema);
+
+int[] indicies = {1, 3};
+double[] values = {1.0, 1.0};
+Vector key = Vectors.sparse(6, indicies, values);
 
 MinHashLSH mh = new MinHashLSH()
-  .setNumHashTables(1)
-  .setInputCol("keys")
-  .setOutputCol("values");
+  .setNumHashTables(5)
+  .setInputCol("features")
+  .setOutputCol("hashes");
+
+MinHashLSHModel model = mh.fit(dfA);
+
+// Feature Transformation
+System.out.println("The hashed dataset where hashed values are stored 
in the column 'values':");
+model.transform(dfA).show();
+// Cache the transformed columns
+Dataset transformedA = model.transform(dfA).cache();
+Dataset transformedB = model.transform(dfB).cache();
+
+// Approximate similarity join
+System.out.println("Approximately joining dfA and dfB on distance 
smaller than 0.6:");
+model.approxSimilarityJoin(dfA, dfB, 0.6)
--- End diff --

Can we make it be "exactDistance" or "euclideanDistance" and 
"jaccardSimilarity" here and in all the examples, for random projection and 
minhash respectively. I think it will be much clearer to the user what distCol 
represents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100420489
  
--- Diff: 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py ---
@@ -0,0 +1,86 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "features"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "features"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="features", 
outputCol="hashes", bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+print("The hashed dataset where hashed values are stored in the column 
'values':")
--- End diff --

no values column anymore


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100426020
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
 ---
@@ -71,25 +71,32 @@ public static void main(String[] args) {
 BucketedRandomProjectionLSH mh = new BucketedRandomProjectionLSH()
   .setBucketLength(2.0)
   .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values");
+  .setInputCol("features")
+  .setOutputCol("hashes");
 
 BucketedRandomProjectionLSHModel model = mh.fit(dfA);
 
 // Feature Transformation
+System.out.println("The hashed dataset where hashed values are stored 
in the column 'values':");
--- End diff --

it's not values anymore


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100199037
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.dense([-1.0, -1.0 ]),),
+... (Vectors.dense([-1.0, 1.0 ]),),
+... 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100198559
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.dense([-1.0, -1.0 ]),),
+... (Vectors.dense([-1.0, 1.0 ]),),
+... 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192059
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193058
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193020
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
+Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values
+are treated as binary "1" values.
+
+.. seealso:: `MinHash `_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df = spark.createDataFrame(data, ["keys"])
+>>> mh = MinHashLSH(inputCol="keys", outputCol="values", seed=12345)
+>>> model = mh.fit(df)
+>>> model.transform(df).head()
+Row(keys=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), 
values=[DenseVector([-1638925712.0])])
+>>> data2 = [(Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df2 = spark.createDataFrame(data2, ["keys"])
+>>> key = Vectors.sparse(6, [1], [1.0])
+>>> model.approxNearestNeighbors(df2, key, 
1).select("distCol").head()[0]
+0.6...
+>>> model.approxSimilarityJoin(df, df2, 
1.0).select("distCol").head()[0]
+0.5
+>>> mhPath = temp_path + "/mh"
+>>> mh.save(mhPath)
+>>> mh2 = MinHashLSH.load(mhPath)
+>>> mh2.getOutputCol() == mh.getOutputCol()
+True
+>>> modelPath = temp_path + "/mh-model"
+>>> model.save(modelPath)
+>>> model2 = MinHashLSHModel.load(modelPath)
+
+.. versionadded:: 2.2.0
+"""
+
+@keyword_only
+def __init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+__init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+"""
+super(MinHashLSH, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.MinHashLSH", self.uid)
+self._setDefault(numHashTables=1)
+kwargs = self.__init__._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.2.0")
+def setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+Sets params for this MinHashLSH.
+"""
+kwargs = self.setParams._input_kwargs
+return self._set(**kwargs)
+
+def _create_model(self, java_model):
+return MinHashLSHModel(java_model)
+
+
+class MinHashLSHModel(JavaModel, LSHModel, JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Model produced by :py:class:`MinHashLSH`, where where multiple hash 
functions are stored. Each
+hash function is picked from the following family of hash functions, 
where :math:`a_i` and
+:math:`b_i` are randomly chosen integers less than prime:
+:math:`h_i(x) = ((x \cdot a_i + b_i) \mod prime)` This hash family is 
approximately min-wise
+independent according to the reference.
+
+.. seealso:: Tom Bohman, Colin Cooper, and Alan Frieze. "Min-wise 
independent linear \
+permutations." Electronic Journal of Combinatorics 7 (2000): R26.
+
+.. versionadded:: 2.2.0
+"""
+
+@property
+@since("2.2.0")
+def randCoefficients(self):
--- End diff --

Removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193043
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192985
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
--- End diff --

That was a mistake. Sorry about it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192933
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="keys", outputCol="values", 
bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

Done for Scala/Java/Python Examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192881
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192685
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
+
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
+
+# Self Join
+model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+
+# Approximate nearest neighbor search
+model.approxNearestNeighbors(dfA, key, 2).show()
--- End diff --

Increased the number of HashTables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192402
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
--- End diff --

Done in Scala/Java doc as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192347
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192333
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192298
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192314
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192074
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni
Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192026
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
--- End diff --

It's not alphabetized here because the declaration order matters for 
PySpark shell.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99940031
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="keys", outputCol="values", 
bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

Other examples typically will output some print statements along with the 
output, explaining what you're seeing. As it is, this example just spits out a 
bunch of dataframes with no explanations. I'd like us to add that here, and for 
the Scala examples really.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99943935
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
+
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
+
+# Self Join
+model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+
+# Approximate nearest neighbor search
+model.approxNearestNeighbors(dfA, key, 2).show()
--- End diff --

These two output empty dataframes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99940201
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
--- End diff --

the file names usually end with "_example". Have we not done that here 
because of how long the name is already? I slightly prefer to stick with the 
convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99943986
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

same comment about print statements here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99912784
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.dense([-1.0, -1.0 ]),),
+... (Vectors.dense([-1.0, 1.0 ]),),
+... 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99923657
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.dense([-1.0, -1.0 ]),),
+... (Vectors.dense([-1.0, 1.0 ]),),
+... 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99901401
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
--- End diff --

"Hash values in the same dimension are calculated..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99929663
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
--- End diff --

add the note that's in the scala doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99909930
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
--- End diff --

Make this inherit `JavaModel` and then the subclasses just inherit this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99871800
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
--- End diff --

This notation is not correct. For Python it should be `Vectors.sparse(10, 
[(2, 1.0), (3, 1.0), (5, 1.0)])`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99871507
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
--- End diff --

"two datasets"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99928739
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
+Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values
+are treated as binary "1" values.
+
+.. seealso:: `MinHash `_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df = spark.createDataFrame(data, ["keys"])
+>>> mh = MinHashLSH(inputCol="keys", outputCol="values", seed=12345)
+>>> model = mh.fit(df)
+>>> model.transform(df).head()
+Row(keys=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), 
values=[DenseVector([-1638925712.0])])
+>>> data2 = [(Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df2 = spark.createDataFrame(data2, ["keys"])
+>>> key = Vectors.sparse(6, [1], [1.0])
+>>> model.approxNearestNeighbors(df2, key, 
1).select("distCol").head()[0]
+0.6...
+>>> model.approxSimilarityJoin(df, df2, 
1.0).select("distCol").head()[0]
+0.5
+>>> mhPath = temp_path + "/mh"
+>>> mh.save(mhPath)
+>>> mh2 = MinHashLSH.load(mhPath)
+>>> mh2.getOutputCol() == mh.getOutputCol()
+True
+>>> modelPath = temp_path + "/mh-model"
+>>> model.save(modelPath)
+>>> model2 = MinHashLSHModel.load(modelPath)
+
+.. versionadded:: 2.2.0
+"""
+
+@keyword_only
+def __init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+__init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+"""
+super(MinHashLSH, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.MinHashLSH", self.uid)
+self._setDefault(numHashTables=1)
+kwargs = self.__init__._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.2.0")
+def setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+Sets params for this MinHashLSH.
+"""
+kwargs = self.setParams._input_kwargs
+return self._set(**kwargs)
+
+def _create_model(self, java_model):
+return MinHashLSHModel(java_model)
+
+
+class MinHashLSHModel(JavaModel, LSHModel, JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Model produced by :py:class:`MinHashLSH`, where where multiple hash 
functions are stored. Each
+hash function is picked from the following family of hash functions, 
where :math:`a_i` and
+:math:`b_i` are randomly chosen integers less than prime:
+:math:`h_i(x) = ((x \cdot a_i + b_i) \mod prime)` This hash family is 
approximately min-wise
+independent according to the reference.
+
+.. seealso:: Tom Bohman, Colin Cooper, and Alan Frieze. "Min-wise 
independent linear \
+permutations." Electronic Journal of Combinatorics 7 (2000): R26.
+
+.. versionadded:: 2.2.0
+"""
+
+@property
+@since("2.2.0")
+def randCoefficients(self):
--- End diff --

This should not be exposed, since it's private in Scala. Also, `Array[(Int, 
Int)]` does not serialize to Python.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99929977
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
--- End diff --

The classes in this file are alphabetized for the most part. Let's keep the 
convention here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99870133
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
--- End diff --

space after "Hashing"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99870243
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
--- End diff --

space here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99872069
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
--- End diff --

Can you match the Scala doc here and below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r99870914
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
--- End diff --

Can we just leave singe probing out since it has no effect and we aren't 
including it in the doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-01-26 Thread Yunni
GitHub user Yunni opened a pull request:

https://github.com/apache/spark/pull/16715

[Spark-18080][ML] Python API & Examples for Locality Sensitive Hashing

## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes 
was based on @yanboliang 's PR #15768 and resolved conflicts and API changes on 
the Scala API. The examples are consistent with Scala examples of MinHashLSH 
and BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Yunni/spark spark-18080

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16715.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16715


commit 85d22c37d3fe0b907f2eaf892729d087f9efb76c
Author: Yanbo Liang 
Date:   2016-11-04T14:22:23Z

Locality Sensitive Hashing (LSH) Python API.

commit cdeca1cdd8ed61274137c3012ba49ff57d459190
Author: Yanbo Liang 
Date:   2016-11-04T14:44:52Z

Fix typos.

commit 66d308bb6d5d254057b0de9217f87391f269aaed
Author: Yun Ni 
Date:   2017-01-25T21:11:57Z

Merge branch 'spark-18080' of https://github.com/yanboliang/spark into 
spark-18080

commit d62a2d0d6cdd1e4cb0626bacfe389274db42a11c
Author: Yun Ni 
Date:   2017-01-26T00:59:15Z

Merge branch 'master' of https://github.com/apache/spark into spark-18080

commit dafc4d120c0606ccd2be892fb2618a1df676ccd3
Author: Yun Ni 
Date:   2017-01-26T01:23:53Z

Changes to fix LSH Python API

commit ac1f4f7190192a3ee6fd8a311a0036e1546e4592
Author: Yunni 
Date:   2017-01-26T05:08:47Z

Merge branch 'spark-18080' of https://github.com/Yunni/spark into 
spark-18080

commit 3a21f2666c907d6d520771b4343af7d877d689bb
Author: Yunni 
Date:   2017-01-26T07:20:12Z

Fix examples and class definition

commit 65dab3ec32f423936f2cb310bbfbc312ece8ac54
Author: Yun Ni 
Date:   2017-01-26T20:19:22Z

Add python examples and updated the user guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org