[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13538: [MINOR] fix typo in documents
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13538 [MINOR] fix typo in documents ## What changes were proposed in this pull request? I use spell check tools checks typo in spark documents and fix them. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix_doc_typo Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13538.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13538 commit 8af1bf8f0d1a59aea35633780ca439f6c459bb78 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-06-06T14:23:10Z fix typo in documents --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13578: [SPARK-15837][ML][PySpark]Word2vec python add max...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13578 [SPARK-15837][ML][PySpark]Word2vec python add maxsentence parameter ## What changes were proposed in this pull request? Word2vec python add maxsentence parameter. ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark word2vec_python_add_maxsentence Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13578.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13578 commit 57384a0a9ffbfe9befe44cb7a9ae226eff603c94 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-06-08T21:28:41Z word2vec_python_add_maxsentence_param --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13578: [SPARK-15837][ML][PySpark]Word2vec python add maxsentenc...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13578 @srowen Hi srowen, I have another similar PR #13558 which past test on my machine, but the official test fail. It seems to be the test server's problem, can you help to check it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13544: [SPARK-15805][SQL][Documents] update sql programming gui...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13544 @rxin a small problem: in `HiveContext` there is a method `refreshTable` for refreshing metadata of Hive table. now using new SparkSession API with hive support, the method is removed, so the new SparkSession API don't need user to refreshing metadata of Hive table ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13558: [SPARK-15820][pyspark][SQL] update python sql int...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13558 [SPARK-15820][pyspark][SQL] update python sql interface refreshTable ## What changes were proposed in this pull request? Add Catalog.refreshTable API into python interface for Spark-SQL. ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark update_python_sql_interface_refreshTable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13558.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13558 commit eedb961ebb141f2bafd1c9798bb09d5de939ea5c Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-06-07T20:04:52Z update python sql interface refreshTable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13538: [MINOR] fix typo in documents
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13538#discussion_r66059465 --- Diff: docs/streaming-programming-guide.md --- @@ -2037,7 +2037,7 @@ and configuring them to receive different partitions of the data stream from the For example, a single Kafka input DStream receiving two topics of data can be split into two Kafka input streams, each receiving only one topic. This would run two receivers, allowing data to be received in parallel, thus increasing overall throughput. These multiple -DStreams can be unioned together to create a single DStream. Then the transformations that were +DStreams can be united together to create a single DStream. Then the transformations that were --- End diff -- @srowen recover 'unioned', thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13538: [MINOR] fix typo in documents
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13538 @srowen Yes, I check each md files, and I think it is done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13381: [SPARK-15608][ml][doc] add_isotonic_regression_doc
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13381 @yanboliang Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13525: [MINOR]fix typo a -> an
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13525 [MINOR]fix typo a -> an ## What changes were proposed in this pull request? a->an similar to #13515 Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' a [aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix_typo_a2an Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13525.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13525 commit 59918455fee4a7b88c5c27b60c29a508da3c4500 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-06-06T04:00:09Z fix typo a -> an --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13544: [SPARK-15805][SQL][Documents] update sql programm...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13544 [SPARK-15805][SQL][Documents] update sql programming guide ## What changes were proposed in this pull request? Update the whole sql programming guide doc file , including: update doc with `SparkSession` instead of `SQLContext` update doc with `SparkSession.builder.enableHiveSupport` instead of `HiveContext` update doc with `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable` update doc with `sparkSession.catalog.cacheTable/uncacheTable` instead of `SQLContext.cacheTable/uncacheTable` and so on... ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark update_sql_prog_doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13544.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13544 commit 770abc52e2f3d34e8638b2126587daa4af3490c2 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-06-07T04:03:15Z update sql programming guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13544: [SPARK-15805][SQL][Documents] update sql programming gui...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13544 @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13544: [SPARK-15805][SQL][Documents] update sql programming gui...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13544 @liancheng OK, no problem ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66699400 --- Diff: docs/sql-programming-guide.md --- @@ -1607,13 +1600,13 @@ a regular multi-line JSON file will most often fail. {% highlight r %} # sc is an existing SparkContext. -sqlContext <- sparkRSQL.init(sc) +spark <- sparkRSQL.init(sc) --- End diff -- Currently, `sparkRSQL.init` call `org.apache.spark.sql.api.r.SQLUtils.createSQLContext` which return `SQLContext` object not `SparkSession` object. So here it seems to update the R api ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13381#discussion_r66700697 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.regression.IsotonicRegression +// $example off$ +import org.apache.spark.sql.SparkSession + +/** + * An example demonstrating Isotonic Regression. + * Run with + * {{{ + * bin/run-example ml.IsotonicRegressionExample + * }}} + */ +object IsotonicRegressionExample { + + def main(args: Array[String]): Unit = { + +// Creates a SparkSession. +val spark = SparkSession + .builder + .appName(s"${this.getClass.getSimpleName}") + .getOrCreate() + +// $example on$ +// Loads data. +val dataset = spark.read.format("libsvm") + .load("data/mllib/sample_isotonic_regression_libsvm_data.txt") + +// Trains an isotonic regression model. +val ir = new IsotonicRegression() +val model = ir.fit(dataset) + +println(s"Boundaries in increasing order: ${model.boundaries}") +println(s"Predictions associated with the boundaries: ${model.predictions}") + +// Makes predictions. +model.transform(dataset).show --- End diff -- @jkbradley Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13592#discussion_r66700281 --- Diff: docs/sql-programming-guide.md --- @@ -517,24 +517,26 @@ types such as Sequences or Arrays. This RDD can be implicitly converted to a Dat registered as a table. Tables can be used in subsequent SQL statements. {% highlight scala %} -// sc is an existing SparkContext. -val sqlContext = new org.apache.spark.sql.SQLContext(sc) +val spark: SparkSession // An existing SparkSession // this is used to implicitly convert an RDD to a DataFrame. -import sqlContext.implicits._ +import spark.implicits._ // Define the schema using a case class. // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. case class Person(name: String, age: Int) -// Create an RDD of Person objects and register it as a table. -val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF() +// Create an RDD of Person objects and register it as a temporary view. +val people = sc + .textFile("examples/src/main/resources/people.txt") + .map(_.split(",")) + .map(p => Person(p(0), p(1).trim.toInt)) + .toDF() people.createOrReplaceTempView("people") --- End diff -- Here it seems better to update the input data file as json format, and then can use `SparkSession.read.json('path/to/data.json')` so we don't need to use SparkContext, and can directly get a `DataFrame`, it can simplify the example code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13381#discussion_r66715565 --- Diff: docs/ml-classification-regression.md --- @@ -685,6 +685,76 @@ The implementation matches the result from R's survival function +## Isotonic regression +[Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression) +belongs to the family of regression algorithms. Formally isotonic regression is a problem where +given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` representing observed responses +and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted +finding a function that minimises + +`\begin{equation} + f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2 +\end{equation}` + +with respect to complete order subject to +`$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights. +The resulting function is called isotonic regression and it is unique. +It can be viewed as least squares problem under order restriction. +Essentially isotonic regression is a +[monotonic function](http://en.wikipedia.org/wiki/Monotonic_function) +best fitting the original data points. + +In `spark.ml`, we implement a --- End diff -- @jkbradley Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13381#discussion_r66961848 --- Diff: examples/src/main/python/mllib/isotonic_regression_example.py --- @@ -23,18 +23,22 @@ from pyspark import SparkContext # $example on$ import math -from pyspark.mllib.regression import IsotonicRegression, IsotonicRegressionModel +from pyspark.mllib.regression import LabeledPoint, IsotonicRegression, IsotonicRegressionModel # $example off$ if __name__ == "__main__": sc = SparkContext(appName="PythonIsotonicRegressionExample") # $example on$ -data = sc.textFile("data/mllib/sample_isotonic_regression_data.txt") +# Load and parse the data +def parsePoint(line): +values = [float(x) for x in line.replace(',', ' ').replace(':', ' ').split(' ')] +return (values[0], values[2], 1.0) +data = sc.textFile("data/mllib/sample_isotonic_regression_libsvm_data.txt") --- End diff -- @yanboliang Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 @andrewor14 It looks strange, I test on my own machine and it is all OK. If it is the test server's problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13525: [MINOR]fix typo a -> an
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13525 @srowen ok, so the comand don't work correctly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13525: [MINOR]fix typo a -> an
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/13525 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13381 [SPARK-15608][ml][doc] add_isotonic_regression_doc ## What changes were proposed in this pull request? add ml doc for ml isotonic regression add scala example for ml isotonic regression ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark add_isotonic_regression_doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13381.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13381 commit 1c974e31dba06491495f432b1d24e16e67baa13c Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-28T14:32:24Z add_isotonic_regression_doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13381#issuecomment-222353670 @holdenk Java & python example added. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15533][SQL]Deprecate Dataset.explode
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13313 [SPARK-15533][SQL]Deprecate Dataset.explode ## What changes were proposed in this pull request? Deprecate Dataset.explode ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix-SPARK-15533 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13313.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13313 commit aaac61a015133ac09a92a8edaeea423254d94d8b Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-25T23:15:40Z Deprecate Dataset.explode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15533][SQL]Deprecate Dataset.explode
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/13313 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13275#issuecomment-222069163 @jkbradley --modules='pyspark-ml' will run a bunch of test in pyspark sub directory parallel, not single python file. and my purpose is add a way to debug single python file code. Because the python test code should launch with bin/pyspark so it is not convenient to directly debug in IDE. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13441: [SPARK-15702][Documentation]Update document progr...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13441 [SPARK-15702][Documentation]Update document programming-guide accumulator section ## What changes were proposed in this pull request? Update document programming-guide accumulator section (scala language) java and python version, because the API haven't done, so I do not modify them. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark update_doc_accumulatorV2_clean Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13441.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13441 commit 598b28dbe1e3ad3f0afd885e72c4a368c3cd9a09 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-06-01T04:01:08Z Update document programming-guide accumulator section --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12987: [spark-15212][SQL]CSV file reader when read file ...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/12987 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15670][Java API][Spark Core]label_accumulator_dep...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13412 [SPARK-15670][Java API][Spark Core]label_accumulator_deprecate_in_java_spark_context ## What changes were proposed in this pull request? Add deprecate annotation for acumulator V1 interface in JavaSparkContext class ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark label_accumulator_deprecate_in_java_spark_context Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13412.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13412 commit 8761c52a8ba39de7ef8b0f99976b103ec7f1cd2e Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-31T03:01:34Z label_accumulator_deprecate_in_java_spark_context --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/13558 @andrewor14 The PR is OK now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13544: [SPARK-15805][SQL][Documents] update sql programm...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/13544 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTab...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13558#discussion_r67496513 --- Diff: python/pyspark/sql/catalog.py --- @@ -232,6 +232,11 @@ def clearCache(self): """Removes all cached tables from the in-memory cache.""" self._jcatalog.clearCache() +@since(2.0) +def refreshTable(self, tableName): +"""Invalidate and refresh all the cached the metadata of the given table.""" --- End diff -- @MLnick Done. I also correct the comment with extra "the" in the scala code. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13172#issuecomment-220260517 @srowen Modified as you expected. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15461][tests][pyspark]modify python tes...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13240 [SPARK-15461][tests][pyspark]modify python test script using version default 2.7 ## What changes were proposed in this pull request? update the default python version used in pytion/run_tests.py to python 2.7. ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark modify_python_test_use_version_to_27 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13240.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13240 commit 04bd30c55134400f01dad845431374a73ffc0406 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-21T05:09:22Z modify python test script using version default 2.7 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15464][ML][MLlib][SQL][Tests] Replace S...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/13242#discussion_r64147805 --- Diff: python/pyspark/ml/clustering.py --- @@ -933,21 +933,20 @@ def getKeepLastCheckpoint(self): if __name__ == "__main__": import doctest import pyspark.ml.clustering -from pyspark.context import SparkContext -from pyspark.sql import SQLContext +from pyspark.sql import SparkSession globs = pyspark.ml.clustering.__dict__.copy() # The small batch size here ensures that we see multiple batches, # even in these small test examples: -sc = SparkContext("local[2]", "ml.clustering tests") -sqlContext = SQLContext(sc) +spark = SparkSession.builder.master("local[2]").appName("ml.clustering tests").getOrCreate() --- End diff -- @techaddict Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15461][tests][pyspark]modify python tes...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/13240 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/13007 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13275#issuecomment-221203315 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13275 [SPARK-15499][PySpark][Tests] Add python testsuite with remote debug and single test parameter to help developer debug code easier ## What changes were proposed in this pull request? To python/run-tests.py script, I add the following parameters: --single-test=SINGLE_TEST specify a python module to run single python test. --debug-server=DEBUG_SERVER debug server host, only used in single test. --debug-port=DEBUG_PORT debug server port, only used in single test. Now, for example, I want to debug only pyspark.tests, I can use the following command: (first startup debug server in your python IDE such as pycharm or pydev, your IED on machine which has host MY_DEV_MACHINE and using debug port 5678, and make sure the machine running test can connect to MY_DEV_MACHINE) python/run-tests --python-executables=python2.7 --single-test=pyspark.tests --debug-server=MY_DEV_MACHINE --debug-port=5678 the parameter --single-test specify the single python testsuite you want to test, currently you can use the following value: ` pyspark.tests pyspark.ml.tests pyspark.mllib.tests pyspark.sql.tests pyspark.streaming.tests ` ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark add_python_remote_debug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13275 commit 872ebdedc80a76b9bf064bda52af6074d16e7efc Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-24T05:32:35Z add python test with single test and remote debug param commit e694d10f27d2da9607c6f52bb7b2a2e2fd50e0fd Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-24T05:41:53Z remove useless code which used for debug by myself --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13275#issuecomment-221210606 @davies How do you think about it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15464][ML][MLlib][SQL][Tests] Replace S...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13242 [SPARK-15464][ML][MLlib][SQL][Tests] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code ## What changes were proposed in this pull request? Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code. ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark python_doctest_update_sparksession Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13242.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13242 commit e6ff76a5309b4b5d0b23e4eccba413cddab5daa0 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-21T16:09:18Z replace SQLContext and SparkContext with SparkSession using builder pattern in python test code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15446][build][sql] modify catalyst usin...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/13224 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15446][build][sql] modify catalyst usin...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13224 [SPARK-15446][build][sql] modify catalyst using longValueExact not supporting java 7 ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) change into BigInteger.longValue supporting java 7 ## How was this patch tested? You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark modify_compile_err_catalyst_longValueExact_error Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13224.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13224 commit 991e34932854194598d5aa05402d53440b750a1d Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-20T15:30:23Z modify catalyst using longValueExact not supporting java 7 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15350][mllib]add unit test function for...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13136 [SPARK-15350][mllib]add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite ## What changes were proposed in this pull request? add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite, and update the test function name in the class to distinct LR with SGD and LBFGS. ## How was this patch tested? The testsuite I modified. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark add_testsuite_LR_LBFGS Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13136.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13136 commit a8593ab94861a2332380a23dd9c849a1b95296fd Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-16T15:15:53Z add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/12978#issuecomment-220021066 In order to check this potential problem more carefully, We can add the following test code like this: ` echo "$newpid" > "$pid" **for i in {1..30};do echo "check daemon stage $i" `ps -p "$newpid" -o comm=` sleep 0.1 done** sleep 2 # Check if the process has died; in that case we'll tail the log so the user can see if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]]; then echo "failed to launch $command:" tail -2 "$log" | sed 's/^/ /' echo "full log in $log" fi ` then run the start daemon script it will print such as: sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /diskext/mySpark/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-n131.out check daemon stage 1 bash check daemon stage 2 bash check daemon stage 3 bash check daemon stage 4 bash check daemon stage 5 java check daemon stage 6 java check daemon stage 7 java check daemon stage 8 java The running result show that the bash status daemon will exist for sometime. Especially when the machine just startup and OS cache is clean, the time may exceed 2s. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/12978#issuecomment-220016724 @srowen According to your suggestion, I add a loop to check whether it pass STAGE-1 and launch java daemon. And I recover the check statement `! $(ps -p "$newpid" -o comm=) =~ "java"` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/12978#discussion_r63700865 --- Diff: sbin/spark-daemon.sh --- @@ -162,6 +162,15 @@ run_command() { esac echo "$newpid" > "$pid" + + for i in {1..10} + do +if [[ $(ps -p "$newpid" -o comm=) =~ "java" ]]; then + break +fi +sleep 0.5 --- End diff -- Oh..the sleep 2 statement can't move. the loop may take at most 5 seconds to check whether the daemon turn into java daemon. When it become java daemon, it jump out the loop. **then must sleep 2 seconds,** then check whether the java daemon die. Because java daemon may exist for some time and then throw an Exception and die. So here need to wait for some time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15322][mllib][core][sql]update deprecat...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13112#issuecomment-219980697 @srowen updated. Seems no problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/12978#discussion_r63708450 --- Diff: sbin/spark-daemon.sh --- @@ -162,6 +162,15 @@ run_command() { esac echo "$newpid" > "$pid" + + for i in {1..10} + do +if [[ $(ps -p "$newpid" -o comm=) =~ "java" ]]; then + break +fi +sleep 0.5 --- End diff -- new clean PR is here: https://github.com/apache/spark/pull/13172 thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]fix bug SPARK-15203
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13172 [SPARK-15203][Deploy]fix bug SPARK-15203 ## What changes were proposed in this pull request? fix bug SPARK-15203 ## How was this patch tested? existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix-spark-15203 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13172.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13172 commit 55d911f29137441ddaa1e9022bf85edac86f72af Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-18T13:52:14Z fix bug SPARK-15203 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 closed the pull request at: https://github.com/apache/spark/pull/12978 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15322][mllib]update deprecate accumulat...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13112 [SPARK-15322][mllib]update deprecate accumulator usage into accumulatorV2 in mllib ## What changes were proposed in this pull request? MLlib code has two position use sc.accumulator method and it is deprecate, update it. mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala line 282 mllib/src/main/scala/org/apache/spark/ml/util/stopwatches.scala line 106 ## How was this patch tested? rerun build and test You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark update_accuV2_in_mllib Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13112.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13112 commit 2761dff513eb2da87464735722807e3ea0ea7676 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-14T06:10:34Z update deprecate accumulator usage into accumulatorV2 in mllib --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15322][mllib]update deprecate accumulat...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13112#issuecomment-219291622 @srowen I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself) @HyukjinKwon I update import org.apache.spark.{SparkContext} ==> import org.apache.spark.SparkContext --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/12978#issuecomment-218089784 @srowen At my virtual machine, After OS started and start spark daemon, the stage 1 describe above will took a long time,often exceeding 2s, I think the java daemon start need to load a huge number of jars so it will took several time when started at the first time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13007#issuecomment-218094762 @HyukjinKwon En..current cvs load code use Hadoop `LineRecordReader`, so not allow a row split into mulit-lines, so I think the code should disable csv multi-line format, or should replace `LineRecordReader` with csv multiline supported reader. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: fix CSV file data-line with newline at first l...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/13007 fix CSV file data-line with newline at first line load error ## What changes were proposed in this pull request? fix CSV file data-line with newline at first line load error. ## How was this patch tested? test as here describe; https://issues.apache.org/jira/browse/SPARK-15226 You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix_firstline_csv Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13007.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13007 commit 049743be3672e5b04e5b02597add0ff49dd80de5 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-09T15:25:54Z fix CSV file data-line with newline at first line load error --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/13007#issuecomment-218037385 @HyukjinKwon I run existing test against this patch and all pass. If need I will add a new test in CSVSuit. And I think the only reason cause the bug is reading first line handling in the code. Because a row with multi-line in csv format is legal and the current csv datasource code using LIB `com.univocity.parsers.csv` also support it. Only when a data row with multi-line appear at file beginning will cause bug and I think this patch can fix it easily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/12978#issuecomment-218040874 @srowen because spark-daemon.sh using exec command to start the java daemon process. when run script, the spark-daemon.sh process will exists for a little time(stage1) and then changed to a java process(stage2), then if java process throw exception, it will die. if stage1 spending more then 2s, then the judgement if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]]; is fail, it will regard the process as failed but in fact it has not come into stag2. In my virtual machine, this thing offen happen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
Github user WeichenXu123 commented on the pull request: https://github.com/apache/spark/pull/12978#issuecomment-218097327 @srowen Er...I am also a little strange while stage 1 may took a long time but it really happen several times... If there is time I will do a more detailed test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [spark-15212][SQL]CSV file reader when read fi...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/12987 [spark-15212][SQL]CSV file reader when read file with first line schema do not filter blank in schema column name ## What changes were proposed in this pull request? When load csv file with schema, add schema column name string trim to avoid blank problem. ## How was this patch tested? construct csv data file contains schema definition with column blank such as: csv data file containsï¼ -- col1, col2,col3,col4,col5 1997,Ford,E350,"ac, abs, moon",3000.00 notice there is a blank before col2, test command: val sqlContext = new org.apache.spark.sql.SQLContext(sc); var reader = sqlContext.read reader.option("header", true) var df = reader.csv("path/to/csvfile") df.select("col2");//check if OK df.registerTempTable("tab1"); sqlContext.sql("select col2 from tab1"); //check if OK. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark bugfix-15212 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12987.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12987 commit 56cc34f37c712d172df6fbc27b2823a571160cda Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-08T12:52:09Z fix bug: spark-15212 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/12978 [SPARK-15203][Deploy]The spark daemon shell script error, daemon process start successfully but script output fail message. This bug is because, sbin/spark-daemon.sh script use bin/spark-class shell to start daemon, then sleep 2s and check whether the daemon process exists, using shell script like following: if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]] the problem is, some machine with bad performance may start the daemon using a long time(exceeding 2s), but still can start daemon successfully, but in this case, the shell script judgement ! $(ps -p "$newpid" -o comm=) =~ "java" will fail, because at this time, the $newpid process is still shell process, until the daemon started, it turns into java process. That's the reason cause the bug. In order to reproduce the bug more easily, you can change sleep 2 ==> sleep 0.01 (sbin/spark-daemon.sh, line 165) to fix this bug, I replace ! $(ps -p "$newpid" -o comm=) =~ "java" (sbin/spark-daemon.sh, line 167) with -z $(ps --no-headers -p "$newpid") You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark fix_shell_001 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12978.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12978 commit 750078c3ab0d34f55db935b9518b3288f98cade1 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-05-07T13:00:58Z fix shell start daemon script bug --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen The sparkContext, by default, will running a cleaner to release not referenced RDD/broadcasts on background. But, I think, we'd better to release them by ourselves because the SparkContext auto-cleaner depends on java-gc. If gc not triggered the cleaner won't release the unused broadcasts. As we can see, if a driver-side broadcast has been unpersisted but not destroyed, its metadata will be keep in the context and the content of the var broadcasted is kept in driver-side, here in KMeans `bcNewCenters` is a two-dimension array so I think it is better to release them as soon as possible. About the overhead, I think to track these history `bcNewCenters` broadcasts only need a reference list and it can destroyed in async way so the overhead is acceptable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBatch met...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14335 @srowen `stats.unpersist(false)` ==> `stats.unpersist()` updated. is there anything else need to update ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen I check `RDD.persist` referenced place: AFTSuvivalRegression, LinearRegression, LogisticRegression, will persist input training RDD and unpersist them when `train` return, seems OK. `recommend.ALS` persist many RDDs and seems unpersist them all OK. mllib `BisectingKMeans.run` contains a TODO "unpersist old indices", I'll check it now. Others seems OK. `Broadcast.persist` referenced place already checked in this PR I think they are all properly handled here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14203: update python dataframe.drop
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14203 update python dataframe.drop ## What changes were proposed in this pull request? Make `dataframe.drop` API in python support multi-columns parameters, so that it is the same with scala API. ## How was this patch tested? The doc test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark drop_python_api Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14203.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14203 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 yeah, but the `bcSyn0Global` in Word2Vec is a difference case, it looks safe there to destroy, because in each loop iteration, the RDD transform which use `bcSyn0Global` ends with a `collect`, after the `collect` action, we no longer need the RDD(the RDD's all computation has done, no more possible recovery) so we also can destroy the `bcSyn0Global` directly in each loop iteration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72003627 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { gammaPart = gammad :: gammaPart } Iterator((stat, gammaPart)) -} +}.persist(StorageLevel.MEMORY_AND_DISK) val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( _ += _, _ += _) -expElogbetaBc.unpersist() val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) +stats.unpersist(false) --- End diff -- @srowen yeah, here we'd better make it consistent with others. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen The `bcNewCenters` in `KMeans` has some problem. Check the code logic in detail, we can find that in each loop, it should destroy the broadcast var `bcNewCenters` generated in the previous loop, not the one generated in current loop. Like what is done to the `costs: RDD`, which use a `preCosts` var to save that generated in previous loop. I update the code. The second problem, what's the meaning of `broadcast.unpersist`, eh, I think, there is another senario, suppose there is a RDD lineage, when executing in normal case, it executed successfully, and in code we can unpersist useless broadcast var in time, but, if some exception happened, the spark can recovery from it, it need to recovery the broken RDD from the RDD lineage and in such case may re-use the broadcast var we had unpersisted. If we simply destroy it, the broadcast var cannot be recover so that the recovery will fail. So that I think the safe place to use `broadcast.destroy` is the place where some action to RDD has successfully executed, and the whole RDD lineage is no longer needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72003530 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { gammaPart = gammad :: gammaPart } Iterator((stat, gammaPart)) -} +}.persist(StorageLevel.MEMORY_AND_DISK) val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( _ += _, _ += _) -expElogbetaBc.unpersist() val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) --- End diff -- @srowen you mean change `stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix)` to `stats.map(_._2).flatMap(list => list).map(_.toDenseMatrix).collect()` ? will the latter one running faster ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72003428 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { gammaPart = gammad :: gammaPart } Iterator((stat, gammaPart)) -} +}.persist(StorageLevel.MEMORY_AND_DISK) --- End diff -- @srowen The type of the RDD to be persisted here is fixed to RDD[(BDM[Double], List[BDV[Double]])] so what's the risk of cannot be persisted ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen yeah, the code logic here seems confusing, but I think it is right. Now I can explain it in a clear way: in essence, the logic can be expressed as following: A0->I1->A1->I2->A2->... A0 is the initial `assignments`, I1 is step-1 `indices`, A1 is step-1 `assignment`, I2 is step-2 `indices`, and so on. There is dependency between them as the arrows show. Now the key point is that when we compute I(K), we must make sure I(K-1) is persisted, and I(K-2) and older ones can be unpersisted. NOW, check my code logic, in fact, in each iteration, I do the following thing: 1. unpersist I(K-1) 2. compute I(K+1) using A(K), and because of dependency, A(K) must use I(K), And I(K) is STILL PERSISTED. 3. compute A(K+1) using I(K+1) But now I found another problem in BisectKMeans: in line 191 there is a iteration it also need this pattern âpersist current step RDD, unpersist previousâ and the iteration's first RDD relates to code here. The problem seems a little troblesome so we'd better create another PR to handle it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14326 @yanboliang I go through the code and there are several problems need to solve: The robust regression has a parameter `sigma` which must > 0, so that it is a bound optimize problem and should use LBFGS-B. But as my test, current breeze LBFGS-B has bugs and when iterating sometimes it will generate NaN value and corrupt the computing. I add some log printing to help debug, I paste a small fragment to show how the LBFGS-B corrupt: **(robust regression w/o intercept w/ regularization test)** costFun: sigma param: 1.0 huberAggrLoss + reg: 18262.68068379334 cost grad- sigma: -630.1789355384457 costFun: sigma param: 631.1789355384457 huberAggrLoss + reg: 1.256602668595641E7 cost grad- sigma: -466.0711286869664 costFun: sigma param: 64.01789355384457 huberAggrLoss + reg: 483796.45119015244 cost grad- sigma: -448.1113667824356 costFun: sigma param: 9.849995439060637 huberAggrLoss + reg: 44154.79971484518 cost grad- sigma: -275.5999029061156 costFun: sigma param: 3.2447088269560513 huberAggrLoss + reg: 8513.171279631315 cost grad- sigma: -5.737776191290681 **costFun: sigma param: NaN huberAggrLoss + reg: NaN** cost grad- sigma: -822.49944 as shown above, when sigma param became NaN in iterating, the LBFGS-B has corrupted and there is no need to continue. When I trace the LBFGS-B I found that in `LBFGSB.subspaceMinimization` method, it may cause output point became (NaN, NaN...) even if the input is OK. so that I think it is a bug in `LBFGSB.subspaceMinimization` . I think this problem has no wark-around way and need Breeze community to fix it. The second problem, whether the loss should divided by N and whether L2 reg should divided by 2, I think it should keep consistent with other GLM alogrithm in mllib. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen Oh, I miss your comment about loop brace, now it added, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14265 cc @jkbradley Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14216 @srowen yeah I have pushed, "some minor update" https://github.com/apache/spark/pull/14216/commits/362074187d8845eeb40452eceec10f7e8ad805df --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14333: [SPARK-16696][ML][MLLib] unused broadcast variabl...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14333 [SPARK-16696][ML][MLLib] unused broadcast variables do destroy call to release memory in time ## What changes were proposed in this pull request? update unused broadcast in KMeans/LDAOptimizer/Word2Vec, use destroy(false) to release memory in time. and several place destroy() update to destroy(false) so that it will be async-called, it will better than blocking called. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark broadvar_unpersist_to_destroy Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14333.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14333 commit 52afc038c79ab8176bf760d65793e8d5f94d4d4a Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-07-20T14:12:41Z broadvar_unpersist_to_destroy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14335 [SPARK-16697][ML][MLLib] improve LDA submitMiniBatch method to avoid redundant RDD computation ## What changes were proposed in this pull request? In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]` and also move the place of unpersisting `expElogbetaBc` broadcast variable, to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early, and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)` ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark improve_LDA Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14335.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14335 commit e5ed33b559a04215c784d0d81a1578d3f13d8804 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-07-20T16:39:27Z improve_LDA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14265 @srowen We can check python/ml/tests.py, `VectorTests.test_serialize` function, it contains a test for `SparseMatrix` serializing/unserializing, so that we can confirm that this works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14440: [SPARK-16835][ML] add training data unpersist han...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14440 [SPARK-16835][ML] add training data unpersist handling when throw exception [SPARK-16835][ML] add training data `unpersist` handling when throw exception ## What changes were proposed in this pull request? to `LinearRegression`, `LogisticRegression`, `AFTSuvivalRegression` modify the `unpersist` call to input training `Dataset` to make sure that even when exception throws, the `unpersist` will be called. ## How was this patch tested? Existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark glm_traindata_unpersist_on_exception Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14440.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14440 commit 1a366e6d382ea1e5156ae5d75c6600a2d414295e Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-07-26T07:25:01Z glm_traindata_unpersist_on_exception --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72013619 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { gammaPart = gammad :: gammaPart } Iterator((stat, gammaPart)) -} +}.persist(StorageLevel.MEMORY_AND_DISK) val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( _ += _, _ += _) -expElogbetaBc.unpersist() val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) --- End diff -- `stats.map(_._2).flatMap(list => list).map(_.toDenseMatrix).collect()` it can also work well. but I think the two ways have no big difference considering efficiency. because `stats.map(_._2).flatMap(list => list)` already generate a RDD[DenseVector] Does serialzing a "DenseVector" or a "one row DenseMatrix" have a big difference ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen The KMeans.initKMeansParallel already implements the pattern "persist current step RDD, and unpersist previous one", but I think an RDD persisted can also break down because of disk error or something? If we want to reach the goal "the broadcast can safely be disposed(`broadcast.destroy`) at each iteration too, rather than only at the end", I think it need to use `RDD.checkpoint` instead of `RDD.persist` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen I check the code about KMean `bcNewCenters` again, if we want to make sure the recovery of RDD will successful in any unexcepted case, we have to keep all the `bcNewCenters` generated in each loop, until the loop is done. because each loop the `costs:RDD` is build using preCosts:RDD, so that it became a RDD link. and the loop will and only will keep latest two RDDs being persisted. if the last two RDDs is broken, spark will need to rebuild them from the first RDD, in such case, each historical `bcNewCenters` generated will be used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/14335#discussion_r72014278 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { gammaPart = gammad :: gammaPart } Iterator((stat, gammaPart)) -} +}.persist(StorageLevel.MEMORY_AND_DISK) val statsSum: BDM[Double] = stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))( _ += _, _ += _) -expElogbetaBc.unpersist() val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat( stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix): _*) --- End diff -- En...DenseVector.toDenseMatrix need to copy the whole buffer in the Vector, maybe there is some influence to performance if they are all done in driver side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14333 @srowen I check code again, the problem I mentioned above `But now I found another problem in BisectKMeans: in line 191 there is a iteration it also need this pattern âpersist current step RDD, unpersist previousâ` is NOT A PROBLEM, I misread code. And I check other modifications in this PR again and it seems no problems. So I think the PR is OK now. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14604: [Doc] add config option spark.ui.enabled into doc...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14604 [Doc] add config option spark.ui.enabled into document ## What changes were proposed in this pull request? The configuration doc lost the config option `spark.ui.enabled` (default value is `true`) I think this option is important because many cases we would like to turn it off. so I add it. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark add_doc_param_spark_ui_enabled Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14604.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14604 commit 93fa7f7816017628cc5b898c1d7c6acf385fcde0 Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-08-10T17:19:36Z add config doc option: spark.ui.enabled --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14483: [SPARK-16880][ML][MLLib] make ann training data persiste...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14483 @srowen yeah, others algorithm using LBFGS all have this pattern, only ANN forgot it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14156 yeah, currently it seems to make a little overhead (do a copy), but I think it will take advantage of breeze optimization, in the future, e.g, SIMD instructions or something ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14156 @srowen The := operator in BDM is simply copy one BDM to another, and it is widely used in breeze source, e.g, we can check DenseMatrix.copy function in Breeze: it first use `DenseMatrix.create` to create a new Matrix with the same dimension `val result = DenseMatrix.create(...)` , and them use `result := this` to copy self into the matrix just created. The mechanism of := operator for DenseMatrix is that the DenseMatrix implements the `OpSet` trait. check `DenseMatrix` source file in breeze, in line 985, there is: implicit val setMV_D:OpSet.InPlaceImpl2[...] = new SetDMDVOp[Double]() so, the implementation code is in `SetDMDVOp` class and we can see that in `SetDMDVOp` it do Type Specialization for Double type so that the compiling code will have high efficiency. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14156 @srowen yeah, the function supplied here called cannot be turned into SIMD instructions but I think it can do some parallelization optimization on large matrix, for example we can split the matrix into several blocks and executed the "in place transform" in parallel way, although it haven't added in breeze currently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14483: [SPARK-16880][ML][MLLib] make ann training data p...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/14483 [SPARK-16880][ML][MLLib] make ann training data persisted if needed ## What changes were proposed in this pull request? To Make sure ANN layer input training data to be persisted, so that it can avoid overhead cost if the RDD need to be computed from lineage. ## How was this patch tested? Existing Tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark add_ann_persist_training_data Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14483 commit 0bfece8fea0599a4a4493d6a6822d0c6e542745b Author: WeichenXu <weichenxu...@outlook.com> Date: 2016-08-03T12:11:19Z add ann persist training data if needed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14629: [WIP][SPARK-17046][SQL] prevent user using dataframe.sel...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14629 MySql do not allow select with 0 columns, and I think select() is useless, no one will do such operation, so, is it better to generate compiling error when detecting code use `df.select()` because it is usually a coding mistake? Or, `df.select()` is useful in some particular scenario? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14520: [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoi...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14520 @yanboliang Thanks for carefully review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14520: [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoi...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14520 @sethah I attach the test result and it looks good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14520: [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoi...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14520 cc @yanboliang Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org