GitHub user shahidki31 opened a pull request:

    https://github.com/apache/spark/pull/21509

    Check for invalid input type of weight data in ml.PowerIterationClustering

    ## What changes were proposed in this pull request?
    The test case will result the following failure. currently in ml.PIC, there 
is no check for the data type of weight column. 
     ```
     test("invalid input types for weight") {
        val invalidWeightData = spark.createDataFrame(Seq(
          (0L, 1L, "a"),
          (2L, 3L, "b")
        )).toDF("src", "dst", "weight")
    
        val pic = new PowerIterationClustering()
          .setWeightCol("weight")
    
        val result = pic.assignClusters(invalidWeightData)
      }
    ```
    ```
    Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, 
most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, 
executor driver): scala.MatchError: [0,1,null] (of class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
        at 
org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
        at 
org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
    ```
    In this PR, added check types for weight column.
    ## How was this patch tested?
    UT added
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shahidki31/spark testCasePic

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21509
    
----
commit 0d6d7be494b6b331a09d91b15a29bac98eac4c74
Author: Shahid <shahidki31@...>
Date:   2018-06-07T20:58:24Z

    Example code for Power Iteration Clustering

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to