[GitHub] spark pull request #18872: [SPARK-21723][ML] Fix writing LibSVM (key not fou...
Github user ProtD commented on a diff in the pull request: https://github.com/apache/spark/pull/18872#discussion_r133150156 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -109,14 +112,15 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { test("write libsvm data and read it again") { val df = spark.read.format("libsvm").load(path) val tempDir2 = new File(tempDir, "read_write_test") --- End diff -- `Utils.createTempDir` seems to be a nicer way. The directory is automatically deleted when VM shuts down, so I believe no manual cleanup (cf. comment lower) is needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18872: [SPARK-21723][ML] Fix writing LibSVM (key not found: num...
Github user ProtD commented on the issue: https://github.com/apache/spark/pull/18872 @srowen Ok, I created and linked a JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18872: [MLlib] Fix writing LibSVM (key not found: numFea...
Github user ProtD commented on a diff in the pull request: https://github.com/apache/spark/pull/18872#discussion_r132671477 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -126,6 +130,29 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { } } + test("write libsvm data from scratch and read it again") { +val rawData = new java.util.ArrayList[Row]() +rawData.add(Row(1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0) +rawData.add(Row(4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) + --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18872: [MLlib] Fix writing LibSVM (key not found: numFeatures)
Github user ProtD commented on the issue: https://github.com/apache/spark/pull/18872 I added the unit test, please review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18872: [MLlib] Fix writing LibSVM
Github user ProtD commented on the issue: https://github.com/apache/spark/pull/18872 To reproduce the bug on v2.2 and v2.3: ```scala import org.apache.spark.ml.linalg.Vectors val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") dfTemp.coalesce(1).write.format("libsvm").save("...filename...") ``` This causes `java.util.NoSuchElementException: key not found: numFeatures`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18872: [MLlib] Fix writing LibSVM
Github user ProtD commented on the issue: https://github.com/apache/spark/pull/18872 @srowen It worked in v2.0, but was broken probably in v2.2.0 by b3d39620c563e5f6a32a4082aa3908e1009c17d2. Current unit tests check writing only for dataframes which were previously read from a LibSVM format, not general ones. (And I guess people don't write LibSVMs very often - that may be why nobody has reported it.) @WeichenXu123 Yes, good idea, will do it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18872: [MLlib] Fix writing LibSVM
GitHub user ProtD opened a pull request: https://github.com/apache/spark/pull/18872 [MLlib] Fix writing LibSVM ## What changes were proposed in this pull request? Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. @liancheng @HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). ## How was this patch tested? Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ProtD/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18872.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18872 commit 3b43de07ea43b341aa782d629dff1e5da970916f Author: Jan Vrsovsky <jan.vrsov...@firma.seznam.cz> Date: 2017-08-07T16:24:11Z check numFeatures only when reading LibSVM -- not when writing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org