from:"WeichenXu123"

[GitHub] spark issue #22873: [SPARK-25866][ML] Update KMeans formatVersion

2018-11-06 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22873
  
LGTM. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22764: [SPARK-25765][ML] Add training cost to BisectingKMeans s...

2018-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22764
  
I think it can target for 3.0. since 2.4 will be released soon and this PR 
looks a little complex and need take  some time to check carefully.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22764: [SPARK-25765][ML] Add training cost to BisectingK...

2018-10-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22764#discussion_r228527916
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -225,13 +227,14 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
   assert(formatVersion == thisFormatVersion)
   val rootId = (metadata \ "rootId").extract[Int]
   val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
+  val trainingCost = (metadata \ "trainingCost").extract[Double]
--- End diff --

OK. After https://github.com/apache/spark/pull/22790 merged, I think this 
PR can work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for class...

2018-10-25 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22790#discussion_r228215468
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -109,7 +109,7 @@ class BisectingKMeansModel private[clustering] (
 
   @Since("2.0.0")
   override def save(sc: SparkContext, path: String): Unit = {
-BisectingKMeansModel.SaveLoadV1_0.save(sc, this, path)
+BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path)
   }
 
   override protected def formatVersion: String = "1.0"
--- End diff --

Oh, good catch! need change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for classNameV2_...

2018-10-25 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22790
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-22 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r227193764
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,113 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources such as Parquet, CSV, JSON and JDBC, we 
also provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
--- End diff --

Yes. This keep the same with other ML algo page.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22793: [SPARK-25793][ML]Call SaveLoadV2_0.load for classNameV2_...

2018-10-22 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22793
  
LGTM pending on Jenkins pass.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22764: [SPARK-25765][ML] Add training cost to BisectingK...

2018-10-20 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22764#discussion_r226826701
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -225,13 +227,14 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
   assert(formatVersion == thisFormatVersion)
   val rootId = (metadata \ "rootId").extract[Int]
   val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
+  val trainingCost = (metadata \ "trainingCost").extract[Double]
--- End diff --

@mgaido91 
(I haven't test this, so correct me if I am wrong).
> I don't see (and think) this change breaks backwards compatibility for 
mllib.

I am suspicious of this line:
```
val trainingCost = (metadata \ "trainingCost").extract[Double]
```
When loading an old version spark saved `BisectingKMeansModel`, because it 
do not contain "trainingCost" info, I guess this line will throw error. 
(Otherwise what will it return ?)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22780: [DOC][MINOR] Fix minor error in the code of graph...

2018-10-19 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22780

[DOC][MINOR] Fix minor error in the code of graphx guide

## What changes were proposed in this pull request?

Fix minor error in the code "sketch of pregel implementation" of GraphX 
guide 

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark minor_doc_update1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22780.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22780


commit 35c77a5ef7b493ac0973ce0af6a66cbdbbbc749b
Author: WeichenXu 
Date:   2018-10-20T02:52:38Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22764: [SPARK-25765][ML] Add training cost to BisectingK...

2018-10-19 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22764#discussion_r226812121
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -225,13 +227,14 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
   assert(formatVersion == thisFormatVersion)
   val rootId = (metadata \ "rootId").extract[Int]
   val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
+  val trainingCost = (metadata \ "trainingCost").extract[Double]
--- End diff --

@mgaido91 
What about this way ?
In `ml` reader, we can load `trainingCost` and then construct 
`mllib.clustering.BisectingKMeansModel` and pass the `trainingCost` argument 
and then construct `ml.clustering.BisectingKMeansModel` to wrap that `mllib` 
model.
** Then in `ml` reader we can check the spark major/minor version to keep 
backwards compatibility, this is a more important thing. **
** In `mllib` loader we can ignore loading `trainingCost`, because `mllib` 
is deprecate, we can ignore adding new features, but we cannot breaking 
backwards compatibility (your change breaking backwards compatibility). **


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22764: [SPARK-25765][ML] Add training cost to BisectingK...

2018-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22764#discussion_r226520698
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -225,13 +227,14 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
   assert(formatVersion == thisFormatVersion)
   val rootId = (metadata \ "rootId").extract[Int]
   val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
+  val trainingCost = (metadata \ "trainingCost").extract[Double]
--- End diff --

- Could you avoid modifying loading model code in "mllib" package, but 
modifying code in "ml" package, i.e., the class 
`ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader`, you can 
reference the `KMeans` code: `ml.clustering.KMeansModel.KMeansModelReader`.

- And, +1 with @viirya mentioned, we should keep model loading 
compatibility, add a version check (when >= 2.4) then we load "training cost" . 
Note that add these in 
`ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader`.

- And, could you also add version check (when >= 2.4) then we load 
"training cost" into `ml.clustering.KMeansModel.KMeansModelReader` ? 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22763: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans examp...

2018-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22763
  
LGTM. thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22756: [SPARK-25758][ML] Deprecate computeCost on BisectingKMea...

2018-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22756
  
LGTM. thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226379993
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,90 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
--- End diff --

Data sources.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226161636
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,49 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON, JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
--- End diff --

added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226161623
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,49 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON, JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
--- End diff --

added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-17 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22675#discussion_r226161557
  
--- Diff: docs/ml-datasource.md ---
@@ -0,0 +1,49 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources like Parquet, CSV, JSON, JDBC, we also 
provide some specific data source for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory.
+The loaded DataFrame has one StructType column: "image". containing image 
data stored as image schema.
+
+
+

+[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
+implements a Spark SQL data source API for loading image data as a 
DataFrame.
+
+{% highlight scala %}
+scala> spark.read.format("image").load("data/mllib/images/origin")
+res1: org.apache.spark.sql.DataFrame = [image: struct]
+{% endhighlight %}
+
+
+

+[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
+implements Spark SQL data source API for loading image data as DataFrame.
+
+{% highlight java %}
+Dataset imagesDF = 
spark.read().format("image").load("data/mllib/images/origin");
+{% endhighlight %}
+
+
+
--- End diff --

This looks like SQL features and fit all datasources. Put it in spark SQL 
doc will be better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22675: [SPARK-25347][ML][DOC] Spark datasource for image/libsvm...

2018-10-09 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22675
  
This do not block 2.4 release. But merge before 2.4 is better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22675: [SPARK-25347][ML][DOC] Spark datasource for image...

2018-10-08 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22675

[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide

## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:

![8ba0b2c3-7a44-4154-8ecf-d001f94d1ca0](https://user-images.githubusercontent.com/19235986/46644290-42cabc00-cbb2-11e8-978e-996f12ef9405.png)
Java:

![4c6ffe71-1268-4890-886f-263f61f25519](https://user-images.githubusercontent.com/19235986/46644298-49f1ca00-cbb2-11e8-9af9-17409a8d29e2.png)
Python:

![2ec95d0f-478b-4bd6-8815-16640bbea5ab](https://user-images.githubusercontent.com/19235986/46644301-4fe7ab00-cbb2-11e8-989f-1280582f5358.png)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark add_image_source_doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22675.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22675


commit 887f5282fba8a8a0bcbb9242eb87b27bf94d0210
Author: WeichenXu 
Date:   2018-10-09T02:54:09Z

init




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22618: [SPARK-25321][ML] Revert SPARK-14681 to avoid API breaki...

2018-10-04 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22618
  
LGTM. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22492: [SPARK-25321][ML] Revert SPARK-14681 to avoid API...

2018-09-25 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/22492


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22510: [SPARK-25321][ML] Fix local LDA model constructor

2018-09-20 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22510

[SPARK-25321][ML] Fix local LDA model constructor

## What changes were proposed in this pull request?

change back the constructor to:
```
class LocalLDAModel private[ml] (
uid: String,
vocabSize: Int,
private[clustering] val oldLocalModel : OldLocalLDAModel,
sparkSession: SparkSession)
```

Although it is marked `private[ml]`, it is used in `mleap` and the master 
change breaks `mleap` building.

## How was this patch tested?

Manual.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark LDA_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22510.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22510


commit 2b2fdaf3f7598fe31161fdd4401728d6b314bbfe
Author: WeichenXu 
Date:   2018-09-21T03:03:30Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22492: [SPARK-25321][ML] Revert SPARK-14681 to avoid API breaki...

2018-09-20 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22492
  
@mengxr Should this be put into master ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22492: [SPARK-25321][ML] Revert SPARK-14681 to avoid API...

2018-09-20 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22492

[SPARK-25321][ML] Revert SPARK-14681 to avoid API breaking change

## What changes were proposed in this pull request?

Revert SPARK-14681 to avoid API breaking change. PR [SPARK-14681] will 
break mleap. 

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark revert_tree_change

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22492


commit d976a7e8646a61bfb76f417f11557f25cc9e
Author: WeichenXu 
Date:   2018-09-20T11:56:04Z

Revert "[SPARK-14681][ML] Provide label/impurity stats for spark.ml 
decision tree nodes"

This reverts commit 252468a744b95082400ba9e8b2e3b3d9d50ab7fa.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22449: [SPARK-22666][ML][FOLLOW-UP] Return a correctly f...

2018-09-18 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22449

[SPARK-22666][ML][FOLLOW-UP] Return a correctly formatted URI for invalid 
images

## What changes were proposed in this pull request?

Change the URI returned in ImageFileFormat for an invalid image row.

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark image_url

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22449.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22449


commit a91f0a8cdb98fa6bdb1438a425df859250c53356
Author: WeichenXu 
Date:   2018-09-18T10:18:32Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...

2018-09-08 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22360
  
Do we need to set `distanceMeasure` again for the parent model ?
When parent model created, it will use the same `distanceMeasure` with the 
one used in training.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...

2018-09-07 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22349#discussion_r216117396
  
--- Diff: python/pyspark/ml/image.py ---
@@ -207,6 +207,9 @@ def readImages(self, path, recursive=False, 
numPartitions=-1,
 .. note:: If sample ratio is less than 1, sampling uses a 
PathFilter that is efficient but
 potentially non-deterministic.
 
+.. note:: Deprecated in 2.4.0. Use 
`spark.read.format("image").load(path)` instead and
--- End diff --

added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...

2018-09-07 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22360
  
@srowen The delegated `mllib.BisectingKMeansModel` is:
```
class BisectingKMeansModel private[clustering] (
private[clustering] val root: ClusteringTreeNode,
@Since("2.4.0") val distanceMeasure: String
  ) extends...
```
its constructor argument `distanceMeasure` is immutable and the 
`setDistanceMeasure` here only change the wrapper model's param and it effect 
nothing.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDista...

2018-09-07 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22360

[MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeasure` method

## What changes were proposed in this pull request?

Remove `BisectingKMeansModel.setDistanceMeasure` method.
In `BisectingKMeansModel` set this param is meaningless.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark bkmeans_update

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22360


commit 64a1c278cb9fc53b4908b1194a79a82de8805c75
Author: WeichenXu 
Date:   2018-09-07T09:54:36Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...

2018-09-06 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22349

[SPARK-25345][ML] Deprecate public APIs from ImageSchema

## What changes were proposed in this pull request?

Deprecate public APIs from ImageSchema.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark image_api_deprecate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22349.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22349


commit 23973218e73cd9ccafe07109cd066eee5aaef3cd
Author: WeichenXu 
Date:   2018-09-06T09:58:14Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-05 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215200249
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala ---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+private[image] class ImageOptions(
+@transient private val parameters: CaseInsensitiveMap[String]) extends 
Serializable {
+
+  def this(parameters: Map[String, String]) = 
this(CaseInsensitiveMap(parameters))
+
+  val dropImageFailures = parameters.getOrElse("dropImageFailures", 
"false").toBoolean
--- End diff --

because `parameters` is `Map[String, String]` type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify the datasource options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions")
+ *
+ *   // Java
+ *   Dataset df = spark.read().format("image")
+ * .option("dropImageFailures", "true")
+ * .load("data/mllib/images/imagesWithPartitions");
+ * }}}
+ *
+ * IMAGE data source supports the following options:
+ *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
+ *
+ * @note This IMAGE data source does not support "write".
+ *
+ * @note This class is public for documentation purpose. Please don't use 
this class directly.
+ * Rather, use the data source API as illustrated above.
+ */
+class ImageDataSource private() {}
--- End diff --

for doc.
similar to `LibSVMDataSource`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138889
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -567,6 +567,7 @@ object DataSource extends Logging {
 val parquet = classOf[ParquetFileFormat].getCanonicalName
 val csv = classOf[CSVFileFormat].getCanonicalName
 val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
+val image = "org.apache.spark.ml.source.image.ImageFileFormat"
--- End diff --

similar to `libsvm`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138862
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
+assert(df2.count === 8)
+  }
+
+  test("image datasource test: read jpg image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=kittens/date=2018-02/DP153539.jpg")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read png image") {
+val df = spark.read.format("image").load(imagePath + 
"/cls=multichannel/date=2018-01/BGRA.png")
+assert(df.count() === 1)
+  }
+
+  test("image datasource test: read non image") {
+val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt"
+val df = spark.read.format("image").option("dropImageFailures", "true")
+  .load(filePath)
+assert(df.count() === 0)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"false")
+  .load(filePath)
+assert(df2.count() === 1)
+val result = df2.head()
+assert(result === invalidImageRow(
+  Paths.get(filePath).toAbsolutePath().normalize().toUri().toString))
+  }
+
+  test("image datasource partition test") {
+val result = spark.read.format("image")
+  .option("dropImageFailures", "true").load(imagePath)
+  .select(substring_index(col("image.origin"), "/", -1).as("origin"), 
col("cls"), col("date"))
+  .collect()
+
+assert(Set(result: _*) === Set(
+  Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"),
+  Row("54893.jpg", "kittens", "2018-02"),
+  Row("DP153539.jpg", "kittens", "2018-02"),
+  Row("DP802813.jpg", "kittens", "2018-02"),
+  Row("BGRA.png", "multichannel", "2018-01"),
+  Row("BGRA_alpha_60.png", "multichannel", "2018-01"),
+  Row("chr30.4.184.jpg", "multichannel", "2018-02"),
+  Row("grayscale.jpg", "multichannel", "2018-02")
+))
+  }
+
+  // Images with the different number of channels
+  test("readImages pixel values test") {
+
+val images = spark.read.format("image").option("dropImageFailures", 
"true")
+  .load(imagePath + "/cls=multichannel/").collect()
+
+val firstBytes20Map = images.map { rrow =>
+  val row = rrow.getAs[Row]("image")
+  val filename = Paths.get(getOrigin(row)).getFileName().toString()
+  val mode = getMode(row)
+  val bytes20 = getData(row).slice(0, 20).toList
+  filename -> Tuple2(mode, bytes20)
--- End diff --

yea, `(mode, bytes20)` doesn't work, `filename -> (mode, bytes20)` will be 
compiled as `filename.->(mode, bytes20)` and `->` receive 2 arguments and 
compile error occurs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138728
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala
 ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+import java.nio.file.Paths
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.image.ImageSchema._
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.functions.{col, substring_index}
+
+class ImageFileFormatSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  // Single column of images named "image"
+  private lazy val imagePath = "../data/mllib/images/imagesWithPartitions"
+
+  test("image datasource count test") {
+val df1 = spark.read.format("image").load(imagePath)
+assert(df1.count === 9)
+
+val df2 = spark.read.format("image").option("dropImageFailures", 
"true").load(imagePath)
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138711
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
+ * {{{
+ *   // Scala
+ *   val df = spark.read.format("image")
+ * .option("dropImageFailures", "true")
--- End diff --

option API require (k: String, v:String) parameters.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215138174
  
--- Diff: data/mllib/images/images/license.txt ---
@@ -0,0 +1,13 @@
+The images in the folder "kittens" are under the creative commons CC0 
license, or no rights reserved:
--- End diff --

No worry. Only very few images. And keep old testcase not changed will help 
this PR get merged ASAP.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22328#discussion_r215135665
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.source.image
+
+/**
+ * `image` package implements Spark SQL data source API for loading IMAGE 
data as `DataFrame`.
+ * The loaded `DataFrame` has one `StructType` column: `image`.
+ * The schema of the `image` column is:
+ *  - origin: String (represent the origin of image. If loaded from file, 
then it is file path)
+ *  - height: Int (height of image)
+ *  - width: Int (width of image)
+ *  - nChannels: Int (number of image channels)
+ *  - mode: Int (OpenCV-compatible type)
+ *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise 
BGR in most cases)
+ *
+ * To use IMAGE data source, you need to set "image" as the format in 
`DataFrameReader` and
+ * optionally specify options, for example:
--- End diff --

The latter "options" is "datasource options", it is the widely used term.
So I prefer to change to "optionally specify the datasource options"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22328: [SPARK-22666][ML][SQL] Spark datasource for image...

2018-09-04 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/22328

[SPARK-22666][ML][SQL] Spark datasource for image format

## What changes were proposed in this pull request?

Implement an image schema datasource.

This image datasource support:
  - partition discovery (loading partitioned images)
  - dropImageFailures (the same behavior with `ImageSchema.readImage`)
  - path wildcard matching (the same behavior with `ImageSchema.readImage`)
  - loading recursively from directory (different from 
`ImageSchema.readImage`, but use such path: `/path/to/dir/**`)

This datasource **NOT** support:
  - specify `numPartitions` (it will be determined by datasource 
automatically)
  - sampling (you can use `df.sample` later but the sampling operator won't 
be pushdown to datasource)

## How was this patch tested?
Unit tests.

## Benchmark
I benchmark and compare the cost time between old `ImageSchema.read` API 
and my image datasource.

**cluster**: 4 nodes, each with 64GB memory, 8 cores CPU
**test dataset**: Flickr8k_Dataset (about 8091 images)

**time cost**:
My image datasource time (automatically generate 258 partitions):  38.04s 
`ImageSchema.read` time (set 16 partitions): 68.4s
`ImageSchema.read` time (set 258 partitions):  90.6s

**time cost when increase image number by double (clone Flickr8k_Dataset 
and loads double number images):
My image datasource time (automatically generate 515 partitions):  95.4s 
`ImageSchema.read` (set 32 partitions): 109s
`ImageSchema.read` (set 515 partitions):  105s

So we can see that my image datasource implementation (this PR) bring some 
performance improvement compared against old`ImageSchema.read` API.











You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark image_datasource

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22328.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22328


commit 5b5aee66b2ea819341b624164298f0700ee07ddf
Author: WeichenXu 
Date:   2018-09-04T09:48:50Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19666: [SPARK-22451][ML] Reduce decision tree aggregate ...

2018-07-09 Thread WeichenXu123

Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/19666


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20446: [SPARK-23254][ML] Add user guide entry and example for D...

2018-07-09 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20446
  
@srowen The reason I do not use `.show` I have already reply here 
https://github.com/apache/spark/pull/20446#discussion_r165565121
thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-06-09 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21513
  
LGTM. Thanks! @mengxr Would you mind take a look ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-06-08 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21513#discussion_r194214431
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1157,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
+   JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+This class is not yet an Estimator/Transformer, use `assignClusters` 
method to run the
+PowerIterationClustering algorithm.
+
+.. seealso:: `Wikipedia on Spectral clustering \
+<http://en.wikipedia.org/wiki/Spectral_clustering>`_
+
+>>> from pyspark.sql.types import DoubleType, LongType, StructField, 
StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> data = [(i, j, sim(points[i], points[j])) for i in range(1, n) for 
j in range(0, i)]
+>>> rdd = sc.parallelize(data, 2)
+>>> schema = StructType([StructField("src", LongType(), False), \
+ StructField("dst", LongType(),  True), \
+ StructField("weight", DoubleType(), True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> assignments = 
pic.setK(2).setMaxIter(40).setWeightCol("weight").assignClusters(df)
+>>> result = sorted(assignments.collect(), key=lambda x: x.id)
+>>> result[0].cluster == result[1].cluster == result[2].cluster == 
result[3].cluster
+True
+>>> result[4].cluster == result[5].cluster == result[6].cluster == 
result[7].cluster
+True
+>>> pic_path = temp_path + "/pic"
+>>> pic.save(pic_path)
+>>> pic2 = PowerIterationClustering.load(pic_path)
+>>> pic2.getK()
+2
+>>> pic2.getMaxIter()
+40
+>>> assignments2 = pic2.assignClusters(df)
+>>> result2 = sorted(assignments2.collect(), key=lambda x: x.id)
+>>> result2[0].cluster == result2[1].cluster == result2[2].cluster == 
result2[3].cluster
+True
+>>> result2[4].cluster == result2[5].cluster == result2[6].cluster == 
result2[7].cluster
+True
+>>> pic3 = PowerIterationClustering(k=4, initMode="degree", 
srcCol="source", dstCol="dest")
+>>> pic3.getSrcCol()
+'source'
+>>> pic3.getDstCol()
+'dest'
+>>> pic3.getK()
+4
+>>> pic3.getMaxIter()
+20
+>>> pic3.getInitMode()
+'degree'
+
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+srcCol = Param(Params._dummy(), "srcCol",
+   "Name of the input column for sou

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-06-08 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21513#discussion_r194214516
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1157,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
+   JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+This class is not yet an Estimator/Transformer, use `assignClusters` 
method to run the
+PowerIterationClustering algorithm.
+
+.. seealso:: `Wikipedia on Spectral clustering \
+<http://en.wikipedia.org/wiki/Spectral_clustering>`_
+
+>>> from pyspark.sql.types import DoubleType, LongType, StructField, 
StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> data = [(i, j, sim(points[i], points[j])) for i in range(1, n) for 
j in range(0, i)]
+>>> rdd = sc.parallelize(data, 2)
+>>> schema = StructType([StructField("src", LongType(), False), \
+ StructField("dst", LongType(),  True), \
+ StructField("weight", DoubleType(), True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> assignments = 
pic.setK(2).setMaxIter(40).setWeightCol("weight").assignClusters(df)
+>>> result = sorted(assignments.collect(), key=lambda x: x.id)
+>>> result[0].cluster == result[1].cluster == result[2].cluster == 
result[3].cluster
+True
+>>> result[4].cluster == result[5].cluster == result[6].cluster == 
result[7].cluster
+True
+>>> pic_path = temp_path + "/pic"
+>>> pic.save(pic_path)
+>>> pic2 = PowerIterationClustering.load(pic_path)
+>>> pic2.getK()
+2
+>>> pic2.getMaxIter()
+40
+>>> assignments2 = pic2.assignClusters(df)
+>>> result2 = sorted(assignments2.collect(), key=lambda x: x.id)
+>>> result2[0].cluster == result2[1].cluster == result2[2].cluster == 
result2[3].cluster
+True
+>>> result2[4].cluster == result2[5].cluster == result2[6].cluster == 
result2[7].cluster
+True
+>>> pic3 = PowerIterationClustering(k=4, initMode="degree", 
srcCol="source", dstCol="dest")
+>>> pic3.getSrcCol()
+'source'
+>>> pic3.getDstCol()
+'dest'
+>>> pic3.getK()
+4
+>>> pic3.getMaxIter()
+20
+>>> pic3.getInitMode()
+'degree'
+
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+srcCol = Param(Params._dummy(), "srcCol",
+   "Name of the input column for sou

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-06-08 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21513#discussion_r194214535
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1157,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
+   JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+This class is not yet an Estimator/Transformer, use `assignClusters` 
method to run the
--- End diff --

```
... use :py:func:`assignClusters` method ...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-06-08 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21513#discussion_r194214831
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1157,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
+   JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+This class is not yet an Estimator/Transformer, use `assignClusters` 
method to run the
+PowerIterationClustering algorithm.
+
+.. seealso:: `Wikipedia on Spectral clustering \
+<http://en.wikipedia.org/wiki/Spectral_clustering>`_
+
+>>> from pyspark.sql.types import DoubleType, LongType, StructField, 
StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> data = [(i, j, sim(points[i], points[j])) for i in range(1, n) for 
j in range(0, i)]
+>>> rdd = sc.parallelize(data, 2)
+>>> schema = StructType([StructField("src", LongType(), False), \
+ StructField("dst", LongType(),  True), \
+ StructField("weight", DoubleType(), True)])
+>>> df = spark.createDataFrame(rdd, schema)
--- End diff --

The test code here is too complex to be in doctest. could you change it to 
code like:
``
df = sc.parallelize(...).toDF(...)
``
generate a small, hardcoded dataset.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-06-08 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21513#discussion_r194215008
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1157,204 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
+   JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+This class is not yet an Estimator/Transformer, use `assignClusters` 
method to run the
+PowerIterationClustering algorithm.
+
+.. seealso:: `Wikipedia on Spectral clustering \
+<http://en.wikipedia.org/wiki/Spectral_clustering>`_
+
+>>> from pyspark.sql.types import DoubleType, LongType, StructField, 
StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> data = [(i, j, sim(points[i], points[j])) for i in range(1, n) for 
j in range(0, i)]
+>>> rdd = sc.parallelize(data, 2)
+>>> schema = StructType([StructField("src", LongType(), False), \
+ StructField("dst", LongType(),  True), \
+ StructField("weight", DoubleType(), True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> assignments = 
pic.setK(2).setMaxIter(40).setWeightCol("weight").assignClusters(df)
+>>> result = sorted(assignments.collect(), key=lambda x: x.id)
+>>> result[0].cluster == result[1].cluster == result[2].cluster == 
result[3].cluster
+True
+>>> result[4].cluster == result[5].cluster == result[6].cluster == 
result[7].cluster
+True
+>>> pic_path = temp_path + "/pic"
+>>> pic.save(pic_path)
+>>> pic2 = PowerIterationClustering.load(pic_path)
+>>> pic2.getK()
+2
+>>> pic2.getMaxIter()
+40
+>>> assignments2 = pic2.assignClusters(df)
+>>> result2 = sorted(assignments2.collect(), key=lambda x: x.id)
+>>> result2[0].cluster == result2[1].cluster == result2[2].cluster == 
result2[3].cluster
+True
+>>> result2[4].cluster == result2[5].cluster == result2[6].cluster == 
result2[7].cluster
+True
--- End diff --

Let's use a simpler way to check result, like:
```
>>> assignments.sort(assignments.id).show(truncate=False)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21513: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-06-08 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21513#discussion_r194167552
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1159,216 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+@inherit_doc
+class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, 
JavaMLReadable,
+   JavaMLWritable):
+"""
+.. note:: Experimental
+
+Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
+http://www.icml2010.org/papers/387.pdf>Lin and Cohen. From 
the abstract:
+PIC finds a very low-dimensional embedding of a dataset using 
truncated power
+iteration on a normalized pair-wise similarity matrix of the data.
+
+This class is not yet an Estimator/Transformer, use `assignClusters` 
method to run the
+PowerIterationClustering algorithm.
+
+.. seealso:: `Wikipedia on Spectral clustering \
+<http://en.wikipedia.org/wiki/Spectral_clustering>`_
+
+>>> from pyspark.sql.types import StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> data = [(i, j, sim(points[i], points[j])) for i in range(1, n) for 
j in range(0, i)]
+>>> rdd = sc.parallelize(data, 2)
+>>> schema = StructType([StructField("src", LongType(), False), \
+ StructField("dst", LongType(),  True), \
+ StructField("weight", DoubleType(), True)])
+>>> df = spark.createDataFrame(rdd, schema)
+>>> pic = PowerIterationClustering()
+>>> assignments = 
pic.setK(2).setMaxIter(40).setWeightCol("weight").assignClusters(df)
+>>> result = sorted(assignments.collect(), key=lambda x: x.id)
+>>> result[0].cluster == result[1].cluster == result[2].cluster == 
result[3].cluster
+True
+>>> result[4].cluster == result[5].cluster == result[6].cluster == 
result[7].cluster
+True
+>>> pic_path = temp_path + "/pic"
+>>> pic.save(pic_path)
+>>> pic2 = PowerIterationClustering.load(pic_path)
+>>> pic2.getK()
+2
+>>> pic2.getMaxIter()
+40
+>>> assignments2 = pic2.assignClusters(df)
+>>> result2 = sorted(assignments2.collect(), key=lambda x: x.id)
+>>> result2[0].cluster == result2[1].cluster == result2[2].cluster == 
result2[3].cluster
+True
+>>> result2[4].cluster == result2[5].cluster == result2[6].cluster == 
result2[7].cluster
+True
+>>> pic3 = PowerIterationClustering(k=4, initMode="degree", 
srcCol="source", dstCol="dest")
+>>> pic3.getSrcCol()
+'source'
+>>> pic3.getDstCol()
+'dest'
+>>> pic3.getK()
+4
+>>> pic3.getMaxIter()
+20
+>>> pic3.getInitMode()
+'degree'
+
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+srcCol = Param(Params._dummy(), "srcCol",
+   "Name of the input column for source vertex IDs.",
+

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-06-08 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21119
  
@huaxingao Create a new PR is better I think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21493: [SPARK-15784] Add Power Iteration Clustering to s...

2018-06-04 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/21493

[SPARK-15784] Add Power Iteration Clustering to spark.ml

## What changes were proposed in this pull request?

According to the discussion on JIRA. I rewrite the Power Iteration 
Clustering API in `spark.ml`.

## How was this patch tested?

Unit test.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark pic_api

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21493.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21493


commit a605a2dba4243e5f1526bc239fd6dbe88dd13ce9
Author: WeichenXu 
Date:   2018-06-04T23:12:41Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-31 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21265
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-30 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r192000596
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
--- End diff --

Just test that python 'int' type range is the same with java 'long' type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-30 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r191996249
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,105 @@ def setParams(self, minSupport=0.3, 
minConfidence=0.8, itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(JavaParams):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+This class is not yet an Estimator/Transformer, use 
:py:func:`findFrequentSequentialPatterns`
+method to run the PrefixSpan algorithm.
+
+@see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+(Wikipedia)
+.. versionadded:: 2.4.0
+
+"""
+
+minSupport = Param(Params._dummy(), "minSupport", "The minimal support 
level of the " +
+   "sequential pattern. Sequential pattern that 
appears more than " +
+   "(minSupport * size-of-the-dataset) times will be 
output. Must be >= 0.",
+   typeConverter=TypeConverters.toFloat)
+
+maxPatternLength = Param(Params._dummy(), "maxPatternLength",
+ "The maximal length of the sequential 
pattern. Must be > 0.",
+ typeConverter=TypeConverters.toInt)
+
+maxLocalProjDBSize = Param(Params._dummy(), "maxLocalProjDBSize",
+   "The maximum number of items (including 
delimiters used in the " +
+   "internal storage format) allowed in a 
projected database before " +
+   "local processing. If a projected database 
exceeds this size, " +
+   "another iteration of distributed prefix 
growth is run. " +
+   "Must be > 0.",
+   typeConverter=TypeConverters.toInt)
--- End diff --

There isn't `TypeConverters.toLong`, do I need to add it ?
My idea is that `TypeConverters.toInt` also fit to Long type in python side 
so I do not add it for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-30 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r191995667
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, 
itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830";>here).
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def findFrequentSequentialPatterns(dataset,
+   sequenceCol,
+   minSupport,
+   maxPatternLength,
+   maxLocalProjDBSize):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
+:param sequenceCol: The name of the sequence column in dataset, 
rows with nulls in this
+column are ignored.
+:param minSupport: The minimal support level of the sequential 
pattern, any pattern that
+   appears more than (minSupport * 
size-of-the-dataset) times will be
+   output (recommended value: `0.1`).
+:param maxPatternLength: The maximal length of the sequential 
pattern
+ (recommended value: `10`).
+:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in the
+   internal storage format) allowed in a 
projected database before
+   local processing. If a projected 
database exceeds this size,
+   another iteration of distributed prefix 
growth is run
+   (recommended value: `3200`).
+:return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+ The schema of it will be:
+  - `sequence: Seq[Seq[T]]` (T is the item type)
+  - `freq: Long`
+
+>>> from pyspark.ml.fpm import PrefixSpan
+>>> from pyspark.sql import Row
+>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --

I think it is better to be put in a example. @mengxr What do you think ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21393: [SPARK-20114][ML][FOLLOW-UP] spark.ml parity for sequent...

2018-05-22 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21393
  
@mengxr @jkbradley 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21393: [SPARK-20114][ML][FOLLOW-UP] spark.ml parity for ...

2018-05-22 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/21393

[SPARK-20114][ML][FOLLOW-UP] spark.ml parity for sequential pattern mining 
- PrefixSpan

## What changes were proposed in this pull request?

Change `PrefixSpan` into a class with param setter/getters.
This address issues mentioned here:
https://github.com/apache/spark/pull/20973#discussion_r186931806

## How was this patch tested?

UT.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_prefix_span

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21393


commit 07327869fd0ea1a9dc4f32da8d10bde2ff231770
Author: WeichenXu 
Date:   2018-05-22T11:52:26Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21163: [SPARK-24097][ML] Instrumentation improvements - RandomF...

2018-05-17 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21163
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-16 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r188853310
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

Sure. Will update soon!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-15 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r188491670
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

this way `final class PrefixSpan(override val uid: String) extends Params` 
seemingly breaks binary compatibility if later we change it into an estimator ?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21129
  
Jenkins test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use weight...

2018-05-10 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/17086
  
LGTM. @jkbradley @mengxr Would you mind take a look ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21163: [SPARK-24097][ML] Instrumentation improvements - RandomF...

2018-05-10 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21163
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21274: [SPARK-24213][ML] Fix for Int id type for PowerIteration...

2018-05-09 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21274
  
LGTM. !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-09 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r186994754
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

I agree with using setters. @jkbradley What do you think of it ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21274: [SPARK-24213][ML] Fix for Int id type for PowerIt...

2018-05-09 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21274#discussion_r186986006
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -232,7 +232,7 @@ class PowerIterationClustering private[clustering] (
 case _: LongType =>
   uncastPredictions
 case otherType =>
--- End diff --

Why not directly use
`case intType: IntType: ` so that make a stronger restriction ?
Or do we need to support other types besides int/long ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21270: [SPARK-24213][ML]Power Iteration Clustering in SparkML t...

2018-05-09 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21270
  
@shahidki31 Seemingly what you said above is anothor issue ? You can create 
another jira for that. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21272: [MINOR][ML][DOC] Improved Naive Bayes user guide explana...

2018-05-09 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21272
  
LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-08 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21129
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-07 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/21265

[SPARK-24146][PySpark][ML] spark.ml parity for sequential pattern mining - 
PrefixSpan: Python API

## What changes were proposed in this pull request?

spark.ml parity for sequential pattern mining - PrefixSpan: Python API

## How was this patch tested?

doctests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark prefix_span_py

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21265.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21265


commit 83eeea1c539d59a4d8496437dcf06d82b43b0ca2
Author: WeichenXu 
Date:   2018-05-08T05:29:24Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2018-05-07 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13493
  
LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20095: [SPARK-22126][ML] Added fitMultiple method with d...

2018-05-07 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20095#discussion_r186381507
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -79,7 +82,52 @@ abstract class Estimator[M <: Model[M]] extends 
PipelineStage {
*/
   @Since("2.0.0")
   def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
--- End diff --

Seemingly we should use the following note to deprecate it:
"""
.. note:: Deprecated in 2.3.0. Use :func:`Estimator.fitMultiple` instead.
"""
and like other places, add a warning
```
warnings.warn("Deprecated in 2.3.0. Use Estimator.fitMultiple instead.", 
DeprecationWarning)
```
?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-05-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r186037589
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
+  val gbt = new GBTClassifier()
+.setMaxDepth(2)
+.setMaxIter(2)
+.setLossType(lossType)
+  val model = gbt.fit(trainData.toDF)
+  val eval1 = model.evaluateEachIteration(validationData.toDF)
+  val eval2 = 
GradientBoostedTrees.evaluateEachIteration(validationData,
--- End diff --

I search scikit-learn doc, there seems no similar method like 
`evaluateEachIteration`, we can only use `staged_predict` in 
`sklearn.ensemble.GradientBoostingRegressor` and then implement almost the 
whole logic again. In R package I also do not find this method.
Now I update the unit test, to compare with hardcoded result.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrumentation improvements fo...

2018-05-03 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21218#discussion_r185970925
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala ---
@@ -423,6 +423,8 @@ class GaussianMixture @Since("2.0.0") (
 val summary = new GaussianMixtureSummary(model.transform(dataset),
   $(predictionCol), $(probabilityCol), $(featuresCol), $(k), 
logLikelihood)
 model.setSummary(Some(summary))
+instr.logNamedValue("logLikelihood", logLikelihood)
+instr.logNamedValue("clusterSizes", summary.clusterSizes.toString)
--- End diff --

also ok I think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20261: [SPARK-22885][ML][TEST] ML test for StructuredStreaming:...

2018-05-03 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20261
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrumentation improvements fo...

2018-05-03 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21218#discussion_r185756220
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -378,6 +378,7 @@ class KMeans @Since("1.5.0") (
   model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
 
 model.setSummary(Some(summary))
+instr.logNamedValue("clusterSizes", summary.clusterSizes.toString)
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrumentation improvements fo...

2018-05-03 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21218#discussion_r185756193
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala ---
@@ -423,6 +423,8 @@ class GaussianMixture @Since("2.0.0") (
 val summary = new GaussianMixtureSummary(model.transform(dataset),
   $(predictionCol), $(probabilityCol), $(featuresCol), $(k), 
logLikelihood)
 model.setSummary(Some(summary))
+instr.logNamedValue("logLikelihood", logLikelihood)
+instr.logNamedValue("clusterSizes", summary.clusterSizes.toString)
--- End diff --

The `clusterSizes.toString` will get an unreadable object address string.
We need to print the content of the array. I suggest add a method like:
`def logNamedArray[T](array: Array[T]): Unit`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

2018-05-03 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20973
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20261: [SPARK-22885][ML][TEST] ML test for StructuredStreaming:...

2018-05-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20261
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

2018-05-02 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/20973
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-30 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r185149879
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -44,26 +43,37 @@ object PrefixSpan {
*
* @param dataset A dataset or a dataframe containing a sequence column 
which is
*{{{Seq[Seq[_]]}}} type
-   * @param sequenceCol the name of the sequence column in dataset
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
* @param minSupport the minimal support level of the sequential 
pattern, any pattern that
*   appears more than (minSupport * 
size-of-the-dataset) times will be output
-   *  (default: `0.1`).
-   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
-   * less than maxPatternLength will be output 
(default: `10`).
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
* @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
*   internal storage format) allowed in a 
projected database before
*   local processing. If a projected database 
exceeds this size, another
-   *   iteration of distributed prefix growth is 
run (default: `3200`).
-   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `frequency: Long`
--- End diff --

sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...

2018-04-27 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21153#discussion_r184626842
  
--- Diff: python/pyspark/ml/util.py ---
@@ -523,11 +534,29 @@ def getAndSetParams(instance, metadata):
 """
 Extract Params from metadata, and set them in the instance.
 """
+# User-supplied param values
 for paramName in metadata['paramMap']:
 param = instance.getParam(paramName)
 paramValue = metadata['paramMap'][paramName]
 instance.set(param, paramValue)
 
+# Default param values
+majorAndMinorVersions = majorMinorVersion(metadata['sparkVersion'])
+assert majorAndMinorVersions is not None, "Error loading metadata: 
Expected " + \
+"Spark version string but found 
{}".format(metadata['sparkVersion'])
+
+major = majorAndMinorVersions[0]
+minor = majorAndMinorVersions[1]
+# For metadata file prior to Spark 2.4, there is no default 
section.
+if major > 2 or (major == 2 and minor >= 4):
+assert 'defaultParamMap' in metadata, "Error loading metadata: 
Expected " + \
+"`defaultParamMap` section not found"
+
+for paramName in metadata['defaultParamMap']:
+param = instance.getParam(paramName)
+paramValue = metadata['defaultParamMap'][paramName]
+instance._setDefault(**{param.name: paramValue})
--- End diff --

remove line `param = instance.getParam(paramName)` and change this line to
`instance._setDefault(**{paramName: paramValue})`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...

2018-04-27 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21153#discussion_r184620777
  
--- Diff: python/pyspark/ml/util.py ---
@@ -417,15 +419,24 @@ def _get_metadata_to_save(instance, sc, 
extraMetadata=None, paramMap=None):
 """
 uid = instance.uid
 cls = instance.__module__ + '.' + instance.__class__.__name__
-params = instance.extractParamMap()
+
+# User-supplied param values
+params = instance._paramMap
 jsonParams = {}
 if paramMap is not None:
 jsonParams = paramMap
 else:
 for p in params:
 jsonParams[p.name] = params[p]
--- End diff --

I think use `_paramMap.copy()` will be simpler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...

2018-04-27 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21153#discussion_r184620855
  
--- Diff: python/pyspark/ml/util.py ---
@@ -417,15 +419,24 @@ def _get_metadata_to_save(instance, sc, 
extraMetadata=None, paramMap=None):
 """
 uid = instance.uid
 cls = instance.__module__ + '.' + instance.__class__.__name__
-params = instance.extractParamMap()
+
+# User-supplied param values
+params = instance._paramMap
 jsonParams = {}
 if paramMap is not None:
 jsonParams = paramMap
 else:
 for p in params:
 jsonParams[p.name] = params[p]
+
+# Default param values
+jsonDefaultParams = {}
+for p in instance._defaultParamMap:
+jsonDefaultParams[p.name] = instance._defaultParamMap[p]
--- End diff --

similar, use `_defaultParamMap.copy()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use weight...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/17086
  
overall good, @jkbradley Would you mind take a look ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r184584878
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -55,44 +60,128 @@ class MulticlassMetricsSuite extends SparkFunSuite 
with MLlibTestSparkContext {
 val f2measure1 = (1 + 2 * 2) * precision1 * recall1 / (2 * 2 * 
precision1 + recall1)
 val f2measure2 = (1 + 2 * 2) * precision2 * recall2 / (2 * 2 * 
precision2 + recall2)
 
-
assert(metrics.confusionMatrix.toArray.sameElements(confusionMatrix.toArray))
-assert(math.abs(metrics.truePositiveRate(0.0) - tpRate0) < delta)
-assert(math.abs(metrics.truePositiveRate(1.0) - tpRate1) < delta)
-assert(math.abs(metrics.truePositiveRate(2.0) - tpRate2) < delta)
-assert(math.abs(metrics.falsePositiveRate(0.0) - fpRate0) < delta)
-assert(math.abs(metrics.falsePositiveRate(1.0) - fpRate1) < delta)
-assert(math.abs(metrics.falsePositiveRate(2.0) - fpRate2) < delta)
-assert(math.abs(metrics.precision(0.0) - precision0) < delta)
-assert(math.abs(metrics.precision(1.0) - precision1) < delta)
-assert(math.abs(metrics.precision(2.0) - precision2) < delta)
-assert(math.abs(metrics.recall(0.0) - recall0) < delta)
-assert(math.abs(metrics.recall(1.0) - recall1) < delta)
-assert(math.abs(metrics.recall(2.0) - recall2) < delta)
-assert(math.abs(metrics.fMeasure(0.0) - f1measure0) < delta)
-assert(math.abs(metrics.fMeasure(1.0) - f1measure1) < delta)
-assert(math.abs(metrics.fMeasure(2.0) - f1measure2) < delta)
-assert(math.abs(metrics.fMeasure(0.0, 2.0) - f2measure0) < delta)
-assert(math.abs(metrics.fMeasure(1.0, 2.0) - f2measure1) < delta)
-assert(math.abs(metrics.fMeasure(2.0, 2.0) - f2measure2) < delta)
+assert(metrics.confusionMatrix.asML ~== confusionMatrix.asML relTol 
delta)
+assert(metrics.truePositiveRate(0.0) ~== tpRate0 absTol delta)
+assert(metrics.truePositiveRate(1.0) ~== tpRate1 absTol delta)
+assert(metrics.truePositiveRate(2.0) ~== tpRate2 absTol delta)
+assert(metrics.falsePositiveRate(0.0) ~== fpRate0 absTol delta)
+assert(metrics.falsePositiveRate(1.0) ~== fpRate1 absTol delta)
+assert(metrics.falsePositiveRate(2.0) ~== fpRate2 absTol delta)
+assert(metrics.precision(0.0) ~== precision0 absTol delta)
+assert(metrics.precision(1.0) ~== precision1 absTol delta)
+assert(metrics.precision(2.0) ~== precision2 absTol delta)
+assert(metrics.recall(0.0) ~== recall0 absTol delta)
+assert(metrics.recall(1.0) ~== recall1 absTol delta)
+assert(metrics.recall(2.0) ~== recall2 absTol delta)
+assert(metrics.fMeasure(0.0) ~== f1measure0 absTol delta)
+assert(metrics.fMeasure(1.0) ~== f1measure1 absTol delta)
+assert(metrics.fMeasure(2.0) ~== f1measure2 absTol delta)
+assert(metrics.fMeasure(0.0, 2.0) ~== f2measure0 absTol delta)
+assert(metrics.fMeasure(1.0, 2.0) ~== f2measure1 absTol delta)
+assert(metrics.fMeasure(2.0, 2.0) ~== f2measure2 absTol delta)
+
+assert(metrics.accuracy ~==
+  (2.0 + 3.0 + 1.0) / ((2 + 3 + 1) + (1 + 1 + 1)) absTol delta)
+assert(metrics.accuracy ~== metrics.precision absTol delta)
+assert(metrics.accuracy ~== metrics.recall absTol delta)
+assert(metrics.accuracy ~== metrics.fMeasure absTol delta)
+assert(metrics.accuracy ~== metrics.weightedRecall absTol delta)
+val weight0 = 4.0 / 9
+val weight1 = 4.0 / 9
+val weight2 = 1.0 / 9
+assert(metrics.weightedTruePositiveRate ~==
+  (weight0 * tpRate0 + weight1 * tpRate1 + weight2 * tpRate2) absTol 
delta)
+assert(metrics.weightedFalsePositiveRate ~==
+  (weight0 * fpRate0 + weight1 * fpRate1 + weight2 * fpRate2) absTol 
delta)
+assert(metrics.weightedPrecision ~==
+  (weight0 * precision0 + weight1 * precision1 + weight2 * precision2) 
absTol delta)
+assert(metrics.weightedRecall ~==
+  (weight0 * recall0 + weight1 * recall1 + weight2 * recall2) absTol 
delta)
+assert(metrics.weightedFMeasure ~==
+  (weight0 * f1measure0 + weight1 * f1measure1 + weight2 * f1measure2) 
absTol delta)
+assert(metrics.weightedFMeasure(2.0) ~==
+  (weight0 * f2measure0 + weight1 * f2measure1 + weight2 * f2measure2) 
absTol delta)
+assert(metrics.labels === labels)
+  }
+
+  test("Multiclass evaluation metrics with weights") {
+/*
+ * Confusion matrix for 3-class classification with total 9 instances 
with 2 weights:
+ * |2 * w1|1 * w2 |1 * w1| true class0 (4 instances)
+ * |1 * w2|2 * w1 + 1 * w2|0 | true class1 (4 in

[GitHub] spark pull request #17086: [SPARK-24101][ML][MLLIB] ML Evaluators should use...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r184566012
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -95,4 +95,95 @@ class MulticlassMetricsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   ((4.0 / 9) * f2measure0 + (4.0 / 9) * f2measure1 + (1.0 / 9) * 
f2measure2)) < delta)
 assert(metrics.labels.sameElements(labels))
   }
+
+  test("Multiclass evaluation metrics with weights") {
+/*
+ * Confusion matrix for 3-class classification with total 9 instances 
with 2 weights:
+ * |2 * w1|1 * w2 |1 * w1| true class0 (4 instances)
+ * |1 * w2|2 * w1 + 1 * w2|0 | true class1 (4 instances)
+ * |0 |0  |1 * w2| true class2 (1 instance)
+ */
+val w1 = 2.2
+val w2 = 1.5
+val tw = 2.0 * w1 + 1.0 * w2 + 1.0 * w1 + 1.0 * w2 + 2.0 * w1 + 1.0 * 
w2 + 1.0 * w2
+val confusionMatrix = Matrices.dense(3, 3,
+  Array(2 * w1, 1 * w2, 0, 1 * w2, 2 * w1 + 1 * w2, 0, 1 * w1, 0, 1 * 
w2))
+val labels = Array(0.0, 1.0, 2.0)
+val predictionAndLabelsWithWeights = sc.parallelize(
+  Seq((0.0, 0.0, w1), (0.0, 1.0, w2), (0.0, 0.0, w1), (1.0, 0.0, w2),
+(1.0, 1.0, w1), (1.0, 1.0, w2), (1.0, 1.0, w1), (2.0, 2.0, w2),
+(2.0, 0.0, w1)), 2)
+val metrics = new MulticlassMetrics(predictionAndLabelsWithWeights)
+val delta = 0.001
+val tpRate0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val tpRate1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val tpRate2 = (1.0 * w2) / (1.0 * w2 + 0)
+val fpRate0 = (1.0 * w2) / (tw - (2.0 * w1 + 1.0 * w2 + 1.0 * w1))
+val fpRate1 = (1.0 * w2) / (tw - (1.0 * w2 + 2.0 * w1 + 1.0 * w2))
+val fpRate2 = (1.0 * w1) / (tw - (1.0 * w2))
+val precision0 = (2.0 * w1) / (2 * w1 + 1 * w2)
+val precision1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * 
w2)
+val precision2 = (1.0 * w2) / (1 * w1 + 1 * w2)
+val recall0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val recall1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val recall2 = (1.0 * w2) / (1.0 * w2 + 0)
+val f1measure0 = 2 * precision0 * recall0 / (precision0 + recall0)
+val f1measure1 = 2 * precision1 * recall1 / (precision1 + recall1)
+val f1measure2 = 2 * precision2 * recall2 / (precision2 + recall2)
+val f2measure0 = (1 + 2 * 2) * precision0 * recall0 / (2 * 2 * 
precision0 + recall0)
+val f2measure1 = (1 + 2 * 2) * precision1 * recall1 / (2 * 2 * 
precision1 + recall1)
+val f2measure2 = (1 + 2 * 2) * precision2 * recall2 / (2 * 2 * 
precision2 + recall2)
+
+
assert(metrics.confusionMatrix.toArray.sameElements(confusionMatrix.toArray))
--- End diff --

Oh, that's because you use `Matrices` in `mllib`, change it to `Matrices` 
in `ml`, i.e.,  `import org.apache.spark.ml.linalg.Matrices`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184342231
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
--- End diff --

Why not directly add params into class `PowerIterationClustering`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184346287
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> similarities = []
+>>> for i in range (1, n):
+...neighbor = []
+...weight = []
+...for j in range (i):
+...neighbor.append((long)(j))
+...weight.append(sim(points[i], points[j]))
+...similarities.append([(long)(i), neighbor, weight])
--- End diff --

The doctest code looks like too long, maybe more proper to put it in 
examples.
Could you replace the data generation code here by using a simple hardcoded 
dataset ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184343934
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
--- End diff --

The doc is wrong. Copy doc from scala side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184344777
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> similarities = []
+>>> for i in range (1, n):
+...neighbor = []
+...weight = []
+...for j in range (i):
+...neighbor.append((long)(j))
+...weight.append(sim(points[i], points[j]))
+...similarities.append([(long)(i), neighbor, weight])
+>>> rdd = sc.parallelize(similarities, 2)
+>>> schema = StructType([StructField("id", LongType(), F

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184344901
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
+"""
+return self.getOrDefault(self.k)
+
+@since("2.4.0")
+def getInitMode(self):
+"""
+Gets the value of `initMode`
+"""
+return self.getOrDefault(self.initMode)
+
+@since("2.4.0")
+def getIdCol(self):
+"""
+Gets the value of `idCol`
+"""
+return self.getOrDefault(self.idCol)
+
+@since("2.4.0")
+def getNeighborsCol(self):
+"""
+Gets the value of `neighborsCol`
+"""
+return self.getOrDefault(self.neighborsCol)
+
+@since("2.4.0")
+def getSimilaritiesCol(self):
+"""
+Gets the value of `similaritiesCol`
+"""
+return self.getOrDefault(self.binary)
+
+
+@inherit_doc
+class PowerIterationClustering(JavaTransformer, 
_PowerIterationClusteringParams, JavaMLReadable,
+   JavaMLWritable):
+"""
+Model produced by [[PowerIterationClustering]].
+>>> from pyspark.sql.types import ArrayType, DoubleType, LongType, 
StructField, StructType
+>>> import math
+>>> def genCircle(r, n):
+... points = []
+... for i in range(0, n):
+... theta = 2.0 * math.pi * i / n
+... points.append((r * math.cos(theta), r * math.sin(theta)))
+... return points
+>>> def sim(x, y):
+... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - 
y[1])
+... return math.exp(-dist / 2.0)
+>>> r1 = 1.0
+>>> n1 = 10
+>>> r2 = 4.0
+>>> n2 = 40
+>>> n = n1 + n2
+>>> points = genCircle(r1, n1) + genCircle(r2, n2)
+>>> similarities = []
+>>> for i in range (1, n):
+...neighbor = []
+...weight = []
+...for j in range (i):
+...neighbor.append((long)(j))
+...weight.append(sim(points[i], points[j]))
+...similarities.append([(long)(i), neighbor, weight])
+>>> rdd = sc.parallelize(similarities, 2)
+>>> schema = StructType([StructField("id", LongType(), F

[GitHub] spark pull request #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API ...

2018-04-26 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21119#discussion_r184345688
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1156,6 +1156,201 @@ def getKeepLastCheckpoint(self):
 return self.getOrDefault(self.keepLastCheckpoint)
 
 
+class _PowerIterationClusteringParams(JavaParams, HasMaxIter, 
HasPredictionCol):
+"""
+Params for :py:attr:`PowerIterationClustering`.
+.. versionadded:: 2.4.0
+"""
+
+k = Param(Params._dummy(), "k",
+  "The number of clusters to create. Must be > 1.",
+  typeConverter=TypeConverters.toInt)
+initMode = Param(Params._dummy(), "initMode",
+ "The initialization algorithm. This can be either " +
+ "'random' to use a random vector as vertex 
properties, or 'degree' to use " +
+ "a normalized sum of similarities with other 
vertices.  Supported options: " +
+ "'random' and 'degree'.",
+ typeConverter=TypeConverters.toString)
+idCol = Param(Params._dummy(), "idCol",
+  "Name of the input column for vertex IDs.",
+  typeConverter=TypeConverters.toString)
+neighborsCol = Param(Params._dummy(), "neighborsCol",
+ "Name of the input column for neighbors in the 
adjacency list " +
+ "representation.",
+ typeConverter=TypeConverters.toString)
+similaritiesCol = Param(Params._dummy(), "similaritiesCol",
+"Name of the input column for non-negative 
weights (similarities) " +
+"of edges between the vertex in `idCol` and 
each neighbor in " +
+"`neighborsCol`",
+typeConverter=TypeConverters.toString)
+
+@since("2.4.0")
+def getK(self):
+"""
+Gets the value of `k`
--- End diff --

Should use:
:py:attr:`k`
and update everywhere else.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21163: [SPARK-24097][ML] Instruments improvements - Rand...

2018-04-26 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/21163

[SPARK-24097][ML] Instruments improvements - RandomForest and 
GradientBoostedTree

## What changes were proposed in this pull request?

Instruments improvements for `RandomForest` and `GradientBoostedTree` in 
`ml.tree.impl` package.
**Note:** I add `Instrumentation` and `OptionalInstrumentation` with 
`Serializable`. Because tree algos need to ouput some logs on executor side and 
the instrumentation object need to be broadcast.

## How was this patch tested?

Manual.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark instr_rf_gbt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21163


commit cd5ea88687b72335647d74c0aeef375de01724d9
Author: WeichenXu 
Date:   2018-04-26T04:59:18Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21120: [SPARK-22448][ML] Added sum function to Summerizer and M...

2018-04-24 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21120
  
I doubt that this will slow down the summarizer performance because you add 
sum statistics internally (and this sum value will possible to overflow). 
We can directly use `count * mean` to get the sum if we want to use it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17086: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2018-04-24 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r183645675
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -95,4 +95,95 @@ class MulticlassMetricsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   ((4.0 / 9) * f2measure0 + (4.0 / 9) * f2measure1 + (1.0 / 9) * 
f2measure2)) < delta)
 assert(metrics.labels.sameElements(labels))
   }
+
+  test("Multiclass evaluation metrics with weights") {
+/*
+ * Confusion matrix for 3-class classification with total 9 instances 
with 2 weights:
+ * |2 * w1|1 * w2 |1 * w1| true class0 (4 instances)
+ * |1 * w2|2 * w1 + 1 * w2|0 | true class1 (4 instances)
+ * |0 |0  |1 * w2| true class2 (1 instance)
+ */
+val w1 = 2.2
+val w2 = 1.5
+val tw = 2.0 * w1 + 1.0 * w2 + 1.0 * w1 + 1.0 * w2 + 2.0 * w1 + 1.0 * 
w2 + 1.0 * w2
+val confusionMatrix = Matrices.dense(3, 3,
+  Array(2 * w1, 1 * w2, 0, 1 * w2, 2 * w1 + 1 * w2, 0, 1 * w1, 0, 1 * 
w2))
+val labels = Array(0.0, 1.0, 2.0)
+val predictionAndLabelsWithWeights = sc.parallelize(
+  Seq((0.0, 0.0, w1), (0.0, 1.0, w2), (0.0, 0.0, w1), (1.0, 0.0, w2),
+(1.0, 1.0, w1), (1.0, 1.0, w2), (1.0, 1.0, w1), (2.0, 2.0, w2),
+(2.0, 0.0, w1)), 2)
+val metrics = new MulticlassMetrics(predictionAndLabelsWithWeights)
+val delta = 0.001
+val tpRate0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val tpRate1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val tpRate2 = (1.0 * w2) / (1.0 * w2 + 0)
+val fpRate0 = (1.0 * w2) / (tw - (2.0 * w1 + 1.0 * w2 + 1.0 * w1))
+val fpRate1 = (1.0 * w2) / (tw - (1.0 * w2 + 2.0 * w1 + 1.0 * w2))
+val fpRate2 = (1.0 * w1) / (tw - (1.0 * w2))
+val precision0 = (2.0 * w1) / (2 * w1 + 1 * w2)
+val precision1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * 
w2)
+val precision2 = (1.0 * w2) / (1 * w1 + 1 * w2)
+val recall0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val recall1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val recall2 = (1.0 * w2) / (1.0 * w2 + 0)
+val f1measure0 = 2 * precision0 * recall0 / (precision0 + recall0)
+val f1measure1 = 2 * precision1 * recall1 / (precision1 + recall1)
+val f1measure2 = 2 * precision2 * recall2 / (precision2 + recall2)
+val f2measure0 = (1 + 2 * 2) * precision0 * recall0 / (2 * 2 * 
precision0 + recall0)
+val f2measure1 = (1 + 2 * 2) * precision1 * recall1 / (2 * 2 * 
precision1 + recall1)
+val f2measure2 = (1 + 2 * 2) * precision2 * recall2 / (2 * 2 * 
precision2 + recall2)
+
+
assert(metrics.confusionMatrix.toArray.sameElements(confusionMatrix.toArray))
+assert(math.abs(metrics.truePositiveRate(0.0) - tpRate0) < delta)
--- End diff --

Use operator `A ~== B absTol delta` like other tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17086: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2018-04-24 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r183647005
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -95,4 +95,95 @@ class MulticlassMetricsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   ((4.0 / 9) * f2measure0 + (4.0 / 9) * f2measure1 + (1.0 / 9) * 
f2measure2)) < delta)
 assert(metrics.labels.sameElements(labels))
   }
+
+  test("Multiclass evaluation metrics with weights") {
+/*
+ * Confusion matrix for 3-class classification with total 9 instances 
with 2 weights:
+ * |2 * w1|1 * w2 |1 * w1| true class0 (4 instances)
+ * |1 * w2|2 * w1 + 1 * w2|0 | true class1 (4 instances)
+ * |0 |0  |1 * w2| true class2 (1 instance)
+ */
+val w1 = 2.2
+val w2 = 1.5
+val tw = 2.0 * w1 + 1.0 * w2 + 1.0 * w1 + 1.0 * w2 + 2.0 * w1 + 1.0 * 
w2 + 1.0 * w2
+val confusionMatrix = Matrices.dense(3, 3,
+  Array(2 * w1, 1 * w2, 0, 1 * w2, 2 * w1 + 1 * w2, 0, 1 * w1, 0, 1 * 
w2))
+val labels = Array(0.0, 1.0, 2.0)
+val predictionAndLabelsWithWeights = sc.parallelize(
+  Seq((0.0, 0.0, w1), (0.0, 1.0, w2), (0.0, 0.0, w1), (1.0, 0.0, w2),
+(1.0, 1.0, w1), (1.0, 1.0, w2), (1.0, 1.0, w1), (2.0, 2.0, w2),
+(2.0, 0.0, w1)), 2)
+val metrics = new MulticlassMetrics(predictionAndLabelsWithWeights)
+val delta = 0.001
+val tpRate0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val tpRate1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val tpRate2 = (1.0 * w2) / (1.0 * w2 + 0)
+val fpRate0 = (1.0 * w2) / (tw - (2.0 * w1 + 1.0 * w2 + 1.0 * w1))
+val fpRate1 = (1.0 * w2) / (tw - (1.0 * w2 + 2.0 * w1 + 1.0 * w2))
+val fpRate2 = (1.0 * w1) / (tw - (1.0 * w2))
+val precision0 = (2.0 * w1) / (2 * w1 + 1 * w2)
+val precision1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * 
w2)
+val precision2 = (1.0 * w2) / (1 * w1 + 1 * w2)
+val recall0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val recall1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val recall2 = (1.0 * w2) / (1.0 * w2 + 0)
+val f1measure0 = 2 * precision0 * recall0 / (precision0 + recall0)
+val f1measure1 = 2 * precision1 * recall1 / (precision1 + recall1)
+val f1measure2 = 2 * precision2 * recall2 / (precision2 + recall2)
+val f2measure0 = (1 + 2 * 2) * precision0 * recall0 / (2 * 2 * 
precision0 + recall0)
+val f2measure1 = (1 + 2 * 2) * precision1 * recall1 / (2 * 2 * 
precision1 + recall1)
+val f2measure2 = (1 + 2 * 2) * precision2 * recall2 / (2 * 2 * 
precision2 + recall2)
+
+
assert(metrics.confusionMatrix.toArray.sameElements(confusionMatrix.toArray))
+assert(math.abs(metrics.truePositiveRate(0.0) - tpRate0) < delta)
+assert(math.abs(metrics.truePositiveRate(1.0) - tpRate1) < delta)
+assert(math.abs(metrics.truePositiveRate(2.0) - tpRate2) < delta)
+assert(math.abs(metrics.falsePositiveRate(0.0) - fpRate0) < delta)
+assert(math.abs(metrics.falsePositiveRate(1.0) - fpRate1) < delta)
+assert(math.abs(metrics.falsePositiveRate(2.0) - fpRate2) < delta)
+assert(math.abs(metrics.precision(0.0) - precision0) < delta)
+assert(math.abs(metrics.precision(1.0) - precision1) < delta)
+assert(math.abs(metrics.precision(2.0) - precision2) < delta)
+assert(math.abs(metrics.recall(0.0) - recall0) < delta)
+assert(math.abs(metrics.recall(1.0) - recall1) < delta)
+assert(math.abs(metrics.recall(2.0) - recall2) < delta)
+assert(math.abs(metrics.fMeasure(0.0) - f1measure0) < delta)
+assert(math.abs(metrics.fMeasure(1.0) - f1measure1) < delta)
+assert(math.abs(metrics.fMeasure(2.0) - f1measure2) < delta)
+assert(math.abs(metrics.fMeasure(0.0, 2.0) - f2measure0) < delta)
+assert(math.abs(metrics.fMeasure(1.0, 2.0) - f2measure1) < delta)
+assert(math.abs(metrics.fMeasure(2.0, 2.0) - f2measure2) < delta)
+
+assert(math.abs(metrics.accuracy -
+  (2.0 * w1 + 2.0 * w1 + 1.0 * w2 + 1.0 * w2) / tw) < delta)
+assert(math.abs(metrics.accuracy - metrics.precision) < delta)
+assert(math.abs(metrics.accuracy - metrics.recall) < delta)
+assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
+assert(math.abs(metrics.accuracy - metrics.weightedRecall) < delta)
+assert(math.abs(metrics.weightedTruePositiveRate -
+  (((2 * w1 + 1 * w2 + 1 * w1) / tw) * tpRate0 +
+((1 * w2 + 2 * w1 + 1 * w2) / tw) * tpRate1 +
+(1 * w2 / tw) * tpRate2)) <

[GitHub] spark pull request #17086: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2018-04-24 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r183645265
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -95,4 +95,95 @@ class MulticlassMetricsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   ((4.0 / 9) * f2measure0 + (4.0 / 9) * f2measure1 + (1.0 / 9) * 
f2measure2)) < delta)
 assert(metrics.labels.sameElements(labels))
   }
+
+  test("Multiclass evaluation metrics with weights") {
+/*
+ * Confusion matrix for 3-class classification with total 9 instances 
with 2 weights:
+ * |2 * w1|1 * w2 |1 * w1| true class0 (4 instances)
+ * |1 * w2|2 * w1 + 1 * w2|0 | true class1 (4 instances)
+ * |0 |0  |1 * w2| true class2 (1 instance)
+ */
+val w1 = 2.2
+val w2 = 1.5
+val tw = 2.0 * w1 + 1.0 * w2 + 1.0 * w1 + 1.0 * w2 + 2.0 * w1 + 1.0 * 
w2 + 1.0 * w2
+val confusionMatrix = Matrices.dense(3, 3,
+  Array(2 * w1, 1 * w2, 0, 1 * w2, 2 * w1 + 1 * w2, 0, 1 * w1, 0, 1 * 
w2))
+val labels = Array(0.0, 1.0, 2.0)
+val predictionAndLabelsWithWeights = sc.parallelize(
+  Seq((0.0, 0.0, w1), (0.0, 1.0, w2), (0.0, 0.0, w1), (1.0, 0.0, w2),
+(1.0, 1.0, w1), (1.0, 1.0, w2), (1.0, 1.0, w1), (2.0, 2.0, w2),
+(2.0, 0.0, w1)), 2)
+val metrics = new MulticlassMetrics(predictionAndLabelsWithWeights)
+val delta = 0.001
--- End diff --

use `1E-7` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17086: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2018-04-24 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r183647533
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -95,4 +95,95 @@ class MulticlassMetricsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   ((4.0 / 9) * f2measure0 + (4.0 / 9) * f2measure1 + (1.0 / 9) * 
f2measure2)) < delta)
 assert(metrics.labels.sameElements(labels))
   }
+
+  test("Multiclass evaluation metrics with weights") {
+/*
+ * Confusion matrix for 3-class classification with total 9 instances 
with 2 weights:
+ * |2 * w1|1 * w2 |1 * w1| true class0 (4 instances)
+ * |1 * w2|2 * w1 + 1 * w2|0 | true class1 (4 instances)
+ * |0 |0  |1 * w2| true class2 (1 instance)
+ */
+val w1 = 2.2
+val w2 = 1.5
+val tw = 2.0 * w1 + 1.0 * w2 + 1.0 * w1 + 1.0 * w2 + 2.0 * w1 + 1.0 * 
w2 + 1.0 * w2
+val confusionMatrix = Matrices.dense(3, 3,
+  Array(2 * w1, 1 * w2, 0, 1 * w2, 2 * w1 + 1 * w2, 0, 1 * w1, 0, 1 * 
w2))
+val labels = Array(0.0, 1.0, 2.0)
+val predictionAndLabelsWithWeights = sc.parallelize(
+  Seq((0.0, 0.0, w1), (0.0, 1.0, w2), (0.0, 0.0, w1), (1.0, 0.0, w2),
+(1.0, 1.0, w1), (1.0, 1.0, w2), (1.0, 1.0, w1), (2.0, 2.0, w2),
+(2.0, 0.0, w1)), 2)
+val metrics = new MulticlassMetrics(predictionAndLabelsWithWeights)
+val delta = 0.001
+val tpRate0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val tpRate1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val tpRate2 = (1.0 * w2) / (1.0 * w2 + 0)
+val fpRate0 = (1.0 * w2) / (tw - (2.0 * w1 + 1.0 * w2 + 1.0 * w1))
+val fpRate1 = (1.0 * w2) / (tw - (1.0 * w2 + 2.0 * w1 + 1.0 * w2))
+val fpRate2 = (1.0 * w1) / (tw - (1.0 * w2))
+val precision0 = (2.0 * w1) / (2 * w1 + 1 * w2)
+val precision1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * 
w2)
+val precision2 = (1.0 * w2) / (1 * w1 + 1 * w2)
+val recall0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val recall1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val recall2 = (1.0 * w2) / (1.0 * w2 + 0)
+val f1measure0 = 2 * precision0 * recall0 / (precision0 + recall0)
+val f1measure1 = 2 * precision1 * recall1 / (precision1 + recall1)
+val f1measure2 = 2 * precision2 * recall2 / (precision2 + recall2)
+val f2measure0 = (1 + 2 * 2) * precision0 * recall0 / (2 * 2 * 
precision0 + recall0)
+val f2measure1 = (1 + 2 * 2) * precision1 * recall1 / (2 * 2 * 
precision1 + recall1)
+val f2measure2 = (1 + 2 * 2) * precision2 * recall2 / (2 * 2 * 
precision2 + recall2)
+
+
assert(metrics.confusionMatrix.toArray.sameElements(confusionMatrix.toArray))
--- End diff --

don't `toArray`, use `assert(metrics.confusionMatrix ~== confusionMatrix 
relTol e)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17086: [SPARK-18693][ML][MLLIB] ML Evaluators should use...

2018-04-24 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17086#discussion_r183646411
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/MulticlassMetricsSuite.scala
 ---
@@ -95,4 +95,95 @@ class MulticlassMetricsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   ((4.0 / 9) * f2measure0 + (4.0 / 9) * f2measure1 + (1.0 / 9) * 
f2measure2)) < delta)
 assert(metrics.labels.sameElements(labels))
   }
+
+  test("Multiclass evaluation metrics with weights") {
+/*
+ * Confusion matrix for 3-class classification with total 9 instances 
with 2 weights:
+ * |2 * w1|1 * w2 |1 * w1| true class0 (4 instances)
+ * |1 * w2|2 * w1 + 1 * w2|0 | true class1 (4 instances)
+ * |0 |0  |1 * w2| true class2 (1 instance)
+ */
+val w1 = 2.2
+val w2 = 1.5
+val tw = 2.0 * w1 + 1.0 * w2 + 1.0 * w1 + 1.0 * w2 + 2.0 * w1 + 1.0 * 
w2 + 1.0 * w2
+val confusionMatrix = Matrices.dense(3, 3,
+  Array(2 * w1, 1 * w2, 0, 1 * w2, 2 * w1 + 1 * w2, 0, 1 * w1, 0, 1 * 
w2))
+val labels = Array(0.0, 1.0, 2.0)
+val predictionAndLabelsWithWeights = sc.parallelize(
+  Seq((0.0, 0.0, w1), (0.0, 1.0, w2), (0.0, 0.0, w1), (1.0, 0.0, w2),
+(1.0, 1.0, w1), (1.0, 1.0, w2), (1.0, 1.0, w1), (2.0, 2.0, w2),
+(2.0, 0.0, w1)), 2)
+val metrics = new MulticlassMetrics(predictionAndLabelsWithWeights)
+val delta = 0.001
+val tpRate0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val tpRate1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val tpRate2 = (1.0 * w2) / (1.0 * w2 + 0)
+val fpRate0 = (1.0 * w2) / (tw - (2.0 * w1 + 1.0 * w2 + 1.0 * w1))
+val fpRate1 = (1.0 * w2) / (tw - (1.0 * w2 + 2.0 * w1 + 1.0 * w2))
+val fpRate2 = (1.0 * w1) / (tw - (1.0 * w2))
+val precision0 = (2.0 * w1) / (2 * w1 + 1 * w2)
+val precision1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * 
w2)
+val precision2 = (1.0 * w2) / (1 * w1 + 1 * w2)
+val recall0 = (2.0 * w1) / (2.0 * w1 + 1.0 * w2 + 1.0 * w1)
+val recall1 = (2.0 * w1 + 1.0 * w2) / (2.0 * w1 + 1.0 * w2 + 1.0 * w2)
+val recall2 = (1.0 * w2) / (1.0 * w2 + 0)
+val f1measure0 = 2 * precision0 * recall0 / (precision0 + recall0)
+val f1measure1 = 2 * precision1 * recall1 / (precision1 + recall1)
+val f1measure2 = 2 * precision2 * recall2 / (precision2 + recall2)
+val f2measure0 = (1 + 2 * 2) * precision0 * recall0 / (2 * 2 * 
precision0 + recall0)
+val f2measure1 = (1 + 2 * 2) * precision1 * recall1 / (2 * 2 * 
precision1 + recall1)
+val f2measure2 = (1 + 2 * 2) * precision2 * recall2 / (2 * 2 * 
precision2 + recall2)
+
+
assert(metrics.confusionMatrix.toArray.sameElements(confusionMatrix.toArray))
+assert(math.abs(metrics.truePositiveRate(0.0) - tpRate0) < delta)
+assert(math.abs(metrics.truePositiveRate(1.0) - tpRate1) < delta)
+assert(math.abs(metrics.truePositiveRate(2.0) - tpRate2) < delta)
+assert(math.abs(metrics.falsePositiveRate(0.0) - fpRate0) < delta)
+assert(math.abs(metrics.falsePositiveRate(1.0) - fpRate1) < delta)
+assert(math.abs(metrics.falsePositiveRate(2.0) - fpRate2) < delta)
+assert(math.abs(metrics.precision(0.0) - precision0) < delta)
+assert(math.abs(metrics.precision(1.0) - precision1) < delta)
+assert(math.abs(metrics.precision(2.0) - precision2) < delta)
+assert(math.abs(metrics.recall(0.0) - recall0) < delta)
+assert(math.abs(metrics.recall(1.0) - recall1) < delta)
+assert(math.abs(metrics.recall(2.0) - recall2) < delta)
+assert(math.abs(metrics.fMeasure(0.0) - f1measure0) < delta)
+assert(math.abs(metrics.fMeasure(1.0) - f1measure1) < delta)
+assert(math.abs(metrics.fMeasure(2.0) - f1measure2) < delta)
+assert(math.abs(metrics.fMeasure(0.0, 2.0) - f2measure0) < delta)
+assert(math.abs(metrics.fMeasure(1.0, 2.0) - f2measure1) < delta)
+assert(math.abs(metrics.fMeasure(2.0, 2.0) - f2measure2) < delta)
+
+assert(math.abs(metrics.accuracy -
+  (2.0 * w1 + 2.0 * w1 + 1.0 * w2 + 1.0 * w2) / tw) < delta)
+assert(math.abs(metrics.accuracy - metrics.precision) < delta)
+assert(math.abs(metrics.accuracy - metrics.recall) < delta)
+assert(math.abs(metrics.accuracy - metrics.fMeasure) < delta)
+assert(math.abs(metrics.accuracy - metrics.weightedRecall) < delta)
+assert(math.abs(metrics.weightedTruePositiveRate -
+  (((2 * w1 + 1 * w2 + 1 * w1) / tw) * tpRate0 +
+((1 * w2 + 2 * w1 + 1 * w2) / tw) * tpRate1 +
+(1 * w2 / tw) * tpRate2)) <

[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-04-24 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/21129
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1170 matches

Mail list logo