Repository: spark
Updated Branches:
  refs/heads/master d38cf217e -> 28dbde387


[SPARK-7983] [MLLIB] Add require for one-based indices in loadLibSVMFile

jira: https://issues.apache.org/jira/browse/SPARK-7983

Customers frequently use zero-based indices in their LIBSVM files. No warnings 
or errors from Spark will be reported during their computation afterwards, and 
usually it will lead to wired result for many algorithms (like GBDT).

add a quick check.

Author: Yuhao Yang <hhb...@gmail.com>

Closes #6538 from hhbyyh/loadSVM and squashes the following commits:

79d9c11 [Yuhao Yang] optimization as respond to comments
4310710 [Yuhao Yang] merge conflict
96460f1 [Yuhao Yang] merge conflict
20a2811 [Yuhao Yang] use require
6e4f8ca [Yuhao Yang] add check for ascending order
9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception
5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/28dbde38
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/28dbde38
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/28dbde38

Branch: refs/heads/master
Commit: 28dbde3874ccdd44b73675938719b69336d23dac
Parents: d38cf21
Author: Yuhao Yang <hhb...@gmail.com>
Authored: Wed Jun 3 13:15:57 2015 +0200
Committer: Sean Owen <so...@cloudera.com>
Committed: Wed Jun 3 13:15:57 2015 +0200

----------------------------------------------------------------------
 .../org/apache/spark/mllib/util/MLUtils.scala   | 12 +++++++
 .../apache/spark/mllib/util/MLUtilsSuite.scala  | 35 ++++++++++++++++++++
 2 files changed, 47 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/28dbde38/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
index 541f328..52d6468 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
@@ -82,6 +82,18 @@ object MLUtils {
           val value = indexAndValue(1).toDouble
           (index, value)
         }.unzip
+
+        // check if indices are one-based and in ascending order
+        var previous = -1
+        var i = 0
+        val indicesLength = indices.length
+        while (i < indicesLength) {
+          val current = indices(i)
+          require(current > previous, "indices should be one-based and in 
ascending order" )
+          previous = current
+          i += 1
+        }
+
         (label, indices.toArray, values.toArray)
       }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/28dbde38/mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala 
b/mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala
index 734b7ba..70219e9 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala
@@ -25,6 +25,7 @@ import breeze.linalg.{squaredDistance => 
breezeSquaredDistance}
 import com.google.common.base.Charsets
 import com.google.common.io.Files
 
+import org.apache.spark.SparkException
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors}
 import org.apache.spark.mllib.regression.LabeledPoint
@@ -108,6 +109,40 @@ class MLUtilsSuite extends SparkFunSuite with 
MLlibTestSparkContext {
     Utils.deleteRecursively(tempDir)
   }
 
+  test("loadLibSVMFile throws IllegalArgumentException when indices is 
zero-based") {
+    val lines =
+      """
+        |0
+        |0 0:4.0 4:5.0 6:6.0
+      """.stripMargin
+    val tempDir = Utils.createTempDir()
+    val file = new File(tempDir.getPath, "part-00000")
+    Files.write(lines, file, Charsets.US_ASCII)
+    val path = tempDir.toURI.toString
+
+    intercept[SparkException] {
+      loadLibSVMFile(sc, path).collect()
+    }
+    Utils.deleteRecursively(tempDir)
+  }
+
+  test("loadLibSVMFile throws IllegalArgumentException when indices is not in 
ascending order") {
+    val lines =
+      """
+        |0
+        |0 3:4.0 2:5.0 6:6.0
+      """.stripMargin
+    val tempDir = Utils.createTempDir()
+    val file = new File(tempDir.getPath, "part-00000")
+    Files.write(lines, file, Charsets.US_ASCII)
+    val path = tempDir.toURI.toString
+
+    intercept[SparkException] {
+      loadLibSVMFile(sc, path).collect()
+    }
+    Utils.deleteRecursively(tempDir)
+  }
+
   test("saveAsLibSVMFile") {
     val examples = sc.parallelize(Seq(
       LabeledPoint(1.1, Vectors.sparse(3, Seq((0, 1.23), (2, 4.56)))),


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to