[GitHub] spark pull request #19347: Branch 2.2 sparkmlib's output of many algorithms ...
GitHub user ithjz opened a pull request: https://github.com/apache/spark/pull/19347 Branch 2.2 sparkmlib'soutput of many algorithms is not clear What's the use of these **results?** JavaGradientBoostingRegressionExample Test Mean Squared Error: 0.12503 Learned regression GBT model: TreeEnsembleModel regressor with 3 trees Tree 0: If (feature 351 <= 15.0) Predict: 0.0 Else (feature 351 > 15.0) Predict: 1.0 Tree 1: Predict: 0.0 Tree 2: Predict: 0.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19347.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19347 commit e936a96badfeeb2051ee35dc4b0fbecefa9bf4cb Author: Peng Date: 2017-05-24T11:54:17Z [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? Add test cases for PR-18062 ## How was this patch tested? The existing UT Author: Peng Closes #18068 from mpjlu/moreTest. (cherry picked from commit 9afcf127d31b5477a539dde6e5f01861532a1c4c) Signed-off-by: Yanbo Liang commit 1d107242f8ec842c009e0b427f6e4a8313d99aa2 Author: zero323 Date: 2017-05-24T11:57:44Z [SPARK-20631][FOLLOW-UP] Fix incorrect tests. ## What changes were proposed in this pull request? - Fix incorrect tests for `_check_thresholds`. - Move test to `ParamTests`. ## How was this patch tested? Unit tests. Author: zero323 Closes #18085 from zero323/SPARK-20631-FOLLOW-UP. (cherry picked from commit 1816eb3bef930407dc9e083de08f5105725c55d1) Signed-off-by: Yanbo Liang commit 83aeac9e0590e99010d0af8e067822d0ed0971fe Author: Bago Amirbekian Date: 2017-05-24T14:55:38Z [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago Amirbekian Closes #18081 from MrBago/BF-py3floatbug. (cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3) Signed-off-by: Yanbo Liang commit c59ad420b5fda29567f4a06b5f71df76e70e269a Author: Liang-Chi Hsieh Date: 2017-05-24T16:35:40Z [SPARK-20848][SQL] Shutdown the pool after reading parquet files ## What changes were proposed in this pull request? From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads. We should shutdown the pool after reading parquet files. ## How was this patch tested? Added a test to ParquetFileFormatSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh Closes #18073 from viirya/SPARK-20848. (cherry picked from commit f72ad303f05a6d99513ea3b121375726b177199c) Signed-off-by: Wenchen Fan commit b7a2a16b1e01375292938fc48b0a333ec4e7cd30 Author: Reynold Xin Date: 2017-05-24T20:57:19Z [SPARK-20867][SQL] Move hints from Statistics into HintInfo class ## What changes were proposed in this pull request? This is a follow-up to SPARK-20857 to move the broadcast hint from Statistics into a new HintInfo class, so we can be more flexible in adding new hints in the future. ## How was this patch tested? Updated test cases to reflect the change. Author: Reynold Xin Closes #18087 from rxin/SPARK-20867. (cherry picked from commit a64746677bf09ef67e3fd538355a6ee9b5ce8cf4) Signed-off-by: Xiao Li commit 2405afce4e87c0486f2aef1d068f17aea2480b17 Author: Kris Mok Date: 2017-05-25T00:19:35Z [SPARK-20872][SQL] ShuffleExchange.nodeName should handle null coordinator ## What changes were proposed in this pull request? A one-liner change in `ShuffleExchange.nodeName` to cover the case when `coordinator` is `null`, so that the match expression is exhaustive. Please refer to [SPARK-20872](https://issues.apache.org/ji
[GitHub] spark issue #18416: [SPARK-21204][SQL] Add support for Scala Set collection ...
Github user ithjz commented on the issue: https://github.com/apache/spark/pull/18416 I run the examples provided by the official website, errors, missing the necessary packages, and I hope someone will help me [hadoop@hadoop01 bin]$ sh spark-shell --master local[9] Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [2017-07-06 11:59:23,252] WARN Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (org.apache.hadoop.util.NativeCodeLoader:62) [2017-07-06 11:59:23,356] WARN SPARK_CLASSPATH was detected (set to '/data/spark/jars/mysql-connector-java-5.1.40-bin.jar:'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath (org.apache.spark.SparkConf:66) [2017-07-06 11:59:23,357] WARN Setting 'spark.executor.extraClassPath' to '/data/spark/jars/mysql-connector-java-5.1.40-bin.jar:' as a work-around. (org.apache.spark.SparkConf:66) [2017-07-06 11:59:23,357] WARN Setting 'spark.driver.extraClassPath' to '/data/spark/jars/mysql-connector-java-5.1.40-bin.jar:' as a work-around. (org.apache.spark.SparkConf:66) Spark context Web UI available at http://192.168.8.29:4040 Spark context available as 'sc' (master = local[9], app id = local-1499313564077). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_111) Type in expressions to have them evaluated. Type :help for more information. scala> val ds1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host1:port1,host2:port2").option("subscribe", "topic1").load() java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:197) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124) ... 48 elided Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:554) ... 55 more --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org