[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-06-08 Thread huaxingao
Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/21119


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184874390
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,205 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasPredictionCol, 
JavaTransformer, JavaParams,
+   JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+PIC takes an affinity matrix between items (or vertices) as input.  An 
affinity matrix
+is a symmetric matrix whose entries are non-negative similarities 
between items.
+PIC takes this matrix (or graph) as an adjacency matrix.  
Specifically, each input row
+includes:
+
+ - :py:attr:`idCol`: vertex ID
+ - :py:attr:`neighborsCol`: neighbors of vertex in :py:attr:`idCol`
+ - :py:attr:`similaritiesCol`: non-negative weights (similarities) of 
edges between the
+vertex in :py:attr:`idCol` and each neighbor in 
:py:attr:`neighborsCol`
+
+PIC returns a cluster assignment for each input vertex.  It appends a 
new column
+:py:attr:`predictionCol` containing the cluster assignment in 
:py:attr:`[0,k)` for
+each row (vertex).
+
+.. note::
+
+ - [[PowerIterationClustering]] is a transformer with an expensive 
[[transform]] operation.
+Transform runs the iterative PIC algorithm to cluster the whole 
input dataset.
+ - Input validation: This validates that similarities are non-negative 
but does NOT validate
+that the input matrix is symmetric.
+
+.. seealso:: http://en.wikipedia.org/wiki/Spectral_clustering>
+Spectral clustering (Wikipedia)
--- End diff --

You can check other places using `seealso`:

```python
.. seealso:: `Spectral clustering \
`_
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184839152
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasPredictionCol, 
JavaTransformer, JavaParams,
+   JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+PIC takes an affinity matrix between items (or vertices) as input.  An 
affinity matrix
+is a symmetric matrix whose entries are non-negative similarities 
between items.
+PIC takes this matrix (or graph) as an adjacency matrix.  
Specifically, each input row
+includes:
+
+ - :py:class:`idCol`: vertex ID
+ - :py:class:`neighborsCol`: neighbors of vertex in :py:class:`idCol`
+ - :py:class:`similaritiesCol`: non-negative weights (similarities) of 
edges between the
+vertex in :py:class:`idCol` and each neighbor in 
:py:class:`neighborsCol`
+
+PIC returns a cluster assignment for each input vertex.  It appends a 
new column
+:py:class:`predictionCol` containing the cluster assignment in 
:py:class:`[0,k)` for
+each row (vertex).
+
+Notes:
+
+ - [[PowerIterationClustering]] is a transformer with an expensive 
[[transform]] operation.
+Transform runs the iterative PIC algorithm to cluster the whole 
input dataset.
+ - Input validation: This validates that similarities are non-negative 
but does NOT validate
+that the input matrix is symmetric.
+
+@see http://en.wikipedia.org/wiki/Spectral_clustering>
--- End diff --

Use `.. seealso::`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184839158
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasPredictionCol, 
JavaTransformer, JavaParams,
+   JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+PIC takes an affinity matrix between items (or vertices) as input.  An 
affinity matrix
+is a symmetric matrix whose entries are non-negative similarities 
between items.
+PIC takes this matrix (or graph) as an adjacency matrix.  
Specifically, each input row
+includes:
+
+ - :py:class:`idCol`: vertex ID
+ - :py:class:`neighborsCol`: neighbors of vertex in :py:class:`idCol`
+ - :py:class:`similaritiesCol`: non-negative weights (similarities) of 
edges between the
+vertex in :py:class:`idCol` and each neighbor in 
:py:class:`neighborsCol`
+
+PIC returns a cluster assignment for each input vertex.  It appends a 
new column
+:py:class:`predictionCol` containing the cluster assignment in 
:py:class:`[0,k)` for
+each row (vertex).
+
+Notes:
--- End diff --

Use `.. note::`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184839128
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasPredictionCol, 
JavaTransformer, JavaParams,
+   JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+PIC takes an affinity matrix between items (or vertices) as input.  An 
affinity matrix
+is a symmetric matrix whose entries are non-negative similarities 
between items.
+PIC takes this matrix (or graph) as an adjacency matrix.  
Specifically, each input row
+includes:
+
+ - :py:class:`idCol`: vertex ID
--- End diff --

```:py:attr:`idCol` ```? And also the below ```:py:class:`neighborsCol` 
```, etc...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184838981
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasPredictionCol, 
JavaTransformer, JavaParams,
+   JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+PIC takes an affinity matrix between items (or vertices) as input.  An 
affinity matrix
+is a symmetric matrix whose entries are non-negative similarities 
between items.
+PIC takes this matrix (or graph) as an adjacency matrix.  
Specifically, each input row
+includes:
+
+ - :py:class:`idCol`: vertex ID
+ - :py:class:`neighborsCol`: neighbors of vertex in :py:class:`idCol`
+ - :py:class:`similaritiesCol`: non-negative weights (similarities) of 
edges between the
+vertex in :py:class:`idCol` and each neighbor in 
:py:class:`neighborsCol`
+
+PIC returns a cluster assignment for each input vertex.  It appends a 
new column
+:py:class:`predictionCol` containing the cluster assignment in 
:py:class:`[0,k)` for
+each row (vertex).
+
+Notes:
+
+ - [[PowerIterationClustering]] is a transformer with an expensive 
[[transform]] operation.
+Transform runs the iterative PIC algorithm to cluster the whole 
input dataset.
+ - Input validation: This validates that similarities are non-negative 
but does NOT validate
+that the input matrix is symmetric.
+
+@see http://en.wikipedia.org/wiki/Spectral_clustering>
+Spectral clustering (Wikipedia)
+
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> similarities = [((long)(1), [0], [0.5]), ((long)(2), [0, 1], 
[0.7,0.5]), \
+((long)(3), [0, 1, 2], [0.9, 0.7, 0.5]), \
+((long)(4), [0, 1, 2, 3], [1.1, 0.9, 0.7,0.5]), \
+((long)(5), [0, 1, 2, 3, 4], [1.3, 1.1, 0.9, 
0.7,0.5])]
+>>> rdd = sc.parallelize(similarities, 2)
+>>> schema = StructType([StructField("id", LongType(), False), \
+ StructField("neighbors", ArrayType(LongType(), False), 
True), \
+ StructField("similarities", ArrayType(DoubleType(), 
False), True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> result = pic.setK(2).setMaxIter(10).transform(df)
+>>> predictions = sorted(set([(i[0], i[1]) for i in 
result.select(result.id, result.prediction)
+... .collect()]), key=lambda x: x[0])
+>>> predictions[0]
+(1, 1)
+>>> predictions[1]
+(2, 1)
+>>> predictions[2]
+(3, 0)
+>>> predictions[3]
+(4, 0)
+>>> predictions[4]
+(5, 0)
+>>> pic_path = temp_path + "/pic"
+>>> pic.save(pic_path)
+>>> pic2 = PowerIterationClustering.load(pic_path)
+>>> pic2.getK()
+2
+>>> pic2.getMaxIter()
+10
+>>> pic3 = PowerIterationClustering(k=4, initMode="degree")
+>>> pic3.getIdCol()
+'id'
+>>> pic3.getK()
+4
+>>> pic3.getMaxIter()
+20
+>>> pic3.getInitMode()
+'degree'
+
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+  

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184838934
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasPredictionCol, 
JavaTransformer, JavaParams,
+   JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+PIC takes an affinity matrix between items (or vertices) as input.  An 
affinity matrix
+is a symmetric matrix whose entries are non-negative similarities 
between items.
+PIC takes this matrix (or graph) as an adjacency matrix.  
Specifically, each input row
+includes:
+
+ - :py:class:`idCol`: vertex ID
+ - :py:class:`neighborsCol`: neighbors of vertex in :py:class:`idCol`
+ - :py:class:`similaritiesCol`: non-negative weights (similarities) of 
edges between the
+vertex in :py:class:`idCol` and each neighbor in 
:py:class:`neighborsCol`
+
+PIC returns a cluster assignment for each input vertex.  It appends a 
new column
+:py:class:`predictionCol` containing the cluster assignment in 
:py:class:`[0,k)` for
+each row (vertex).
+
+Notes:
+
+ - [[PowerIterationClustering]] is a transformer with an expensive 
[[transform]] operation.
+Transform runs the iterative PIC algorithm to cluster the whole 
input dataset.
+ - Input validation: This validates that similarities are non-negative 
but does NOT validate
+that the input matrix is symmetric.
+
+@see http://en.wikipedia.org/wiki/Spectral_clustering>
+Spectral clustering (Wikipedia)
+
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> similarities = [((long)(1), [0], [0.5]), ((long)(2), [0, 1], 
[0.7,0.5]), \
+((long)(3), [0, 1, 2], [0.9, 0.7, 0.5]), \
+((long)(4), [0, 1, 2, 3], [1.1, 0.9, 0.7,0.5]), \
+((long)(5), [0, 1, 2, 3, 4], [1.3, 1.1, 0.9, 
0.7,0.5])]
+>>> rdd = sc.parallelize(similarities, 2)
+>>> schema = StructType([StructField("id", LongType(), False), \
+ StructField("neighbors", ArrayType(LongType(), False), 
True), \
+ StructField("similarities", ArrayType(DoubleType(), 
False), True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> result = pic.setK(2).setMaxIter(10).transform(df)
+>>> predictions = sorted(set([(i[0], i[1]) for i in 
result.select(result.id, result.prediction)
+... .collect()]), key=lambda x: x[0])
+>>> predictions[0]
+(1, 1)
+>>> predictions[1]
+(2, 1)
+>>> predictions[2]
+(3, 0)
+>>> predictions[3]
+(4, 0)
+>>> predictions[4]
+(5, 0)
+>>> pic_path = temp_path + "/pic"
+>>> pic.save(pic_path)
+>>> pic2 = PowerIterationClustering.load(pic_path)
+>>> pic2.getK()
+2
+>>> pic2.getMaxIter()
+10
+>>> pic3 = PowerIterationClustering(k=4, initMode="degree")
+>>> pic3.getIdCol()
+'id'
+>>> pic3.getK()
+4
+>>> pic3.getMaxIter()
+20
+>>> pic3.getInitMode()
+'degree'
+
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+  

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184838848
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -97,13 +97,15 @@ private[clustering] trait 
PowerIterationClusteringParams extends Params with Has
   def getNeighborsCol: String = $(neighborsCol)
 
   /**
-   * Param for the name of the input column for neighbors in the adjacency 
list representation.
+   * Param for the name of the input column for non-negative weights 
(similarities) of edges
+   * between the vertex in `idCol` and each neighbor in `neighborsCol`.
--- End diff --

Good catch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184809072
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> similarities = []
+>>> for i in range (1, n):
+...neighbor = []
+...weight = []
+...for j in range (i):
+...neighbor.append((long)(j))
+...weight.append(sim(points[i], points[j]))
+...similarities.append([(long)(i), neighbor, weight])
--- End diff --

@WeichenXu123 I will move this to tests, and add a simple example in the 
doctest. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-27 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184808830
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
--- End diff --

@WeichenXu123 Thanks for your review. The params can be either inside class 
PowerIterationClustering or separate. I will move them back inside class 
PowerIterationClustering, to be consistent with the params in the other classes 
in clustering. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184343934
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
--- End diff --

The doc is wrong. Copy doc from scala side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184342231
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
--- End diff --

Why not directly add params into class `PowerIterationClustering`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184346287
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> similarities = []
+>>> for i in range (1, n):
+...neighbor = []
+...weight = []
+...for j in range (i):
+...neighbor.append((long)(j))
+...weight.append(sim(points[i], points[j]))
+...similarities.append([(long)(i), neighbor, weight])
--- End diff --

The doctest code looks like too long, maybe more proper to put it in 
examples.
Could you replace the data generation code here by using a simple hardcoded 
dataset ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184344777
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> similarities = []
+>>> for i in range (1, n):
+...neighbor = []
+...weight = []
+...for j in range (i):
+...neighbor.append((long)(j))
+...weight.append(sim(points[i], points[j]))
+...similarities.append([(long)(i), neighbor, weight])
+>>> rdd = sc.parallelize(similarities, 2)
+>>> schema = StructType([StructField("id", LongType(), False), \
+ StructField("neighbors", ArrayType(LongType(), False), True), 
\
+ StructField("similarities", ArrayType(DoubleType(), False), 
True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> result = pic.setK(2).setMaxIter(40).transform(df)
+>>> predictions = sorted(set([(i[0], i[1]) for i in 
result.select(result.id, result.prediction)
+... .collect()]), key=lambda x: x[0])
+>>> predictions[0]
+(1, 1)
+>>> predictions[8]
+(9, 1)
+>>> 

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184344901
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> similarities = []
+>>> for i in range (1, n):
+...neighbor = []
+...weight = []
+...for j in range (i):
+...neighbor.append((long)(j))
+...weight.append(sim(points[i], points[j]))
+...similarities.append([(long)(i), neighbor, weight])
+>>> rdd = sc.parallelize(similarities, 2)
+>>> schema = StructType([StructField("id", LongType(), False), \
+ StructField("neighbors", ArrayType(LongType(), False), True), 
\
+ StructField("similarities", ArrayType(DoubleType(), False), 
True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> result = pic.setK(2).setMaxIter(40).transform(df)
+>>> predictions = sorted(set([(i[0], i[1]) for i in 
result.select(result.id, result.prediction)
+... .collect()]), key=lambda x: x[0])
+>>> predictions[0]
+(1, 1)
+>>> predictions[8]
+(9, 1)
+>>> 

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184345688
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
--- End diff --

Should use:
:py:attr:`k`
and update everywhere else.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-20 Thread huaxingao
GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/21119

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

## What changes were proposed in this pull request?

add spark.ml Python API for PIC

## How was this patch tested?

add doctest

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark_19826

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21119.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21119


commit 53d7763b58d05a6baf9fcf1cef2ae327a5d42e04
Author: Huaxin Gao 
Date:   2018-04-21T04:15:37Z

[SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org