[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19439 **[Test build #82481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82481/testReport)** for PR 19439 at commit [`0e47b6c`](https://github.com/apache/spark/commit/0e47b6c906afa1589bcb3ee9af87b4833f90be64). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19392: [SPARK-22169][SQL] support byte length literal as identi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19392 OK --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 Thanks for your comments/suggestions @MLnick and @sethah . Working on incorporating these. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19439 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19439 **[Test build #82480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82480/testReport)** for PR 19439 at commit [`22baf02`](https://github.com/apache/spark/commit/22baf022b2f109bb1f5eba0b13ea34de894cd14c). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SamplePathFilter extends Configured with PathFilter ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19439 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82480/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19439 **[Test build #82480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82480/testReport)** for PR 19439 at commit [`22baf02`](https://github.com/apache/spark/commit/22baf022b2f109bb1f5eba0b13ea34de894cd14c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19438#discussion_r143001025 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -2538,7 +2538,7 @@ test_that("describe() and summary() on a DataFrame", { stats2 <- summary(df) expect_equal(collect(stats2)[5, "summary"], "25%") - expect_equal(collect(stats2)[5, "age"], "30") + expect_equal(collect(stats2)[5, "age"], "19") --- End diff -- Also looks more logical given the input contains values 19 and 30 only. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19438#discussion_r143000567 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -43,7 +43,7 @@ class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with Default (0, 1.0, 1.0, 1.0), (1, 3.0, 3.0, 3.0), (2, Double.NaN, Double.NaN, Double.NaN), - (3, -1.0, 2.0, 3.0) + (3, -1.0, 2.0, 1.0) --- End diff -- Did this have to change as a result? just checking it's intentional --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19438#discussion_r142999631 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala --- @@ -129,7 +144,7 @@ class ApproximatePercentileQuerySuite extends QueryTest with SharedSQLContext { withTempView(table) { (1 to 1000).toDF("col").createOrReplaceTempView(table) checkAnswer( -spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 + 800D) FROM $table"), +spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 + 8000D) FROM $table"), --- End diff -- I recall that without the change the answer was "499", which is also really close, so I think this is fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19438#discussion_r143000448 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1038,8 +1038,8 @@ def summary(self, *statistics): | mean| 3.5| null| | stddev|2.1213203435596424| null| |min| 2|Alice| -|25%| 5| null| -|50%| 5| null| +|25%| 2| null| --- End diff -- Although this looks like a big change, the test data set has only two data elements, with values 2 and 5, so these are pretty equally valid. It's probably more logical that the 25% percentile is 2 if 75% is 5. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...
GitHub user imatiach-msft opened a pull request: https://github.com/apache/spark/pull/19439 [SPARK-21866][ML][PySpark] Adding spark image reader ## What changes were proposed in this pull request? Adding spark image reader, an implementation of schema for representing images in spark DataFrames The code is taken from the spark package located here: (https://github.com/Microsoft/spark-images) Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866) Please see mailing list for SPIP vote and approval information: (http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html) # Background and motivation As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers. This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions. This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines. The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead. ## How was this patch tested? Unit tests in scala ImageSchemaSuite, unit tests in python You can merge this pull request into a Git repository by running: $ git pull https://github.com/imatiach-msft/spark ilmat/spark-images Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19439.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19439 commit 22baf022b2f109bb1f5eba0b13ea34de894cd14c Author: Ilya MatiachDate: 2017-10-04T21:10:26Z [SPARK-21866][ML][PySpark] Adding spark image reader --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19420: [SPARK-22191] [SQL] Add hive serde example with s...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19420#discussion_r142999706 --- Diff: examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java --- @@ -124,6 +124,13 @@ public static void main(String[] args) { // ... // $example off:spark_hive$ +// Hive serde's are also supported with serde properties. + String sqlQuery = "CREATE TABLE src_serde(key decimal(38,18), value int) USING hive" --- End diff -- Hi, @crlalam. We use 2-space indentation in general. FYI, maybe, you can see [Scala Coding Style](https://github.com/databricks/scala-style-guide). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19061: [SPARK-21568][CORE] ConsoleProgressBar should only be en...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19061 Could you review this `ConsoleProgressBar` PR again, @vanzin ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82477/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82477/testReport)** for PR 18732 at commit [`f572385`](https://github.com/apache/spark/commit/f572385e28a1ccd2f8663adf64910d5f0a0ce67c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Wor...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17673#discussion_r142991123 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -171,20 +210,46 @@ final class Word2Vec @Since("1.4.0") ( @Since("2.0.0") def setMaxSentenceLength(value: Int): this.type = set(maxSentenceLength, value) + /** @group setParam */ + @Since("2.2.0") + val solvers = Set("sg-hs", "cbow-ns") --- End diff -- Yeah, for reference you can just look at how linear regression does the `supportedSolvers`. Also, the require isn't necessary, you can just use `ParamValidators.inArray[String](supportedSolvers))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Wor...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/17673#discussion_r142990145 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -171,20 +210,46 @@ final class Word2Vec @Since("1.4.0") ( @Since("2.0.0") def setMaxSentenceLength(value: Int): this.type = set(maxSentenceLength, value) + /** @group setParam */ + @Since("2.2.0") + val solvers = Set("sg-hs", "cbow-ns") --- End diff -- "skipgram-hierarchical softmax" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17357 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17357 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82476/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17357 **[Test build #82476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82476/testReport)** for PR 17357 at commit [`b188cc9`](https://github.com/apache/spark/commit/b188cc9a9e290683210d3c4a6841d37ca00b112f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17774: [SPARK-18371][Streaming] Spark Streaming backpressure ge...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17774 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19406: [SPARK-22179] percentile_approx should choose the...
Github user wzhfy closed the pull request at: https://github.com/apache/spark/pull/19406 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19438#discussion_r142981865 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala --- @@ -129,7 +144,7 @@ class ApproximatePercentileQuerySuite extends QueryTest with SharedSQLContext { withTempView(table) { (1 to 1000).toDF("col").createOrReplaceTempView(table) checkAnswer( -spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 + 800D) FROM $table"), +spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 + 8000D) FROM $table"), --- End diff -- here, fix the test case by increasing accuracy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19438 cc @srowen @jiangxb1987 @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19406 Ah, that's fine :). It was just an option. I will follow discussion and help sort it out in any event. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon These two JIRAs change percentile_approx in different ways, so maybe it's better to use different JIRAs? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon uh...just saw this, already created a new [JIRA](url) and [PR](https://github.com/apache/spark/pull/19438), is it also ok? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19438 **[Test build #82479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82479/testReport)** for PR 19438 at commit [`f2b1538`](https://github.com/apache/spark/commit/f2b153800ebdf10999d4a8bb3578101a12f6d631). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19438: [SPARK-22208] [SQL] Improve percentile_approx by ...
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/19438 [SPARK-22208] [SQL] Improve percentile_approx by not rounding up targetError and starting from index 0 ## What changes were proposed in this pull request? Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/1, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2. Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above. ## How was this patch tested? Added a new test case and fix existing test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark improve_percentile_approx Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19438.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19438 commit 24f8295498a7ad6d2d99ea27a196ccf154165907 Author: Zhenhua WangDate: 2017-09-30T16:04:32Z return the first element for small percentage commit 8c8c22dbebe99def6127b49988dfc4f886797bd6 Author: Zhenhua Wang Date: 2017-10-02T10:24:28Z fix test commit dbc3d47b0a56113032d2a4565180932e4ef26219 Author: Zhenhua Wang Date: 2017-10-02T14:53:04Z fix test commit 9815ce8e17e34422f8c915d115061a9635abd119 Author: Zhenhua Wang Date: 2017-10-03T14:51:55Z fix pyspark test commit f2b153800ebdf10999d4a8bb3578101a12f6d631 Author: Zhenhua Wang Date: 2017-10-05T15:47:27Z follow the paper and fix sparkR test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19406 Oh, optionally, we can just edit the JIRA I guess. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen @jiangxb1987 OK, I'll close this JIRA and creating a new JIRA as improvement instead of bugfix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19090: [SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19090 @felixcheung, this one LGTM as I checked what I could all and quite confident; however, will leave this open for few days more considering importance. Let me please cc you here to double check when you have some times or leave some comments if you have some concerns. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19041 **[Test build #82478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82478/testReport)** for PR 19041 at commit [`985874d`](https://github.com/apache/spark/commit/985874da9f72a942d1a28f413167ab3b7fcc64e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142961120 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type): -arrow_return_type = toArrowType(return_type) - -def verify_result_length(*a): -result = f(*a) -if not hasattr(result, "__len__"): -raise TypeError("Return type of pandas_udf should be a Pandas.Series") -if len(result) != len(a[0]): -raise RuntimeError("Result vector from pandas_udf was not the required length: " - "expected %d, got %d" % (len(a[0]), len(result))) -return result -return lambda *a: (verify_result_length(*a), arrow_return_type) +if isinstance(return_type, StructType): --- End diff -- Yes will do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18732 @HyukjinKwon Thanks for the summarry! * https://github.com/apache/spark/pull/18732#discussion_r142735696 `ArrowPandasSerialzer`I will spend some time address this today. * https://github.com/apache/spark/pull/18732#issuecomment-333065737 (Breaking into two pandas udf API) I think is addressed here https://github.com/apache/spark/pull/18732#discussion_r141830344. But I am happy to discuss more. * https://github.com/apache/spark/pull/18732#issuecomment-26073 (API Naming) I will wait on feedback here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19369 **[Test build #3942 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3942/testReport)** for PR 19369 at commit [`d996c28`](https://github.com/apache/spark/commit/d996c283602269afd05dffad1e681f47f7baf47f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19399: [SPARK-22175][WEB-UI] Add status column to histor...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19399#discussion_r142959826 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -850,6 +869,18 @@ private[history] class AppListingListener(log: FileStatus, clock: Clock) extends fileSize) } +def applicationStatus : Option[String] = { + if (startTime.getTime == -1) { +Some("") + } else if (endTime.getTime == -1) { +Some("") + } else if (jobToStatus.isEmpty || jobToStatus.exists(_._2 != "Succeeded")) { --- End diff -- also, I dunno if this criteria is even accurate. You could have a successful app that doesn't run any jobs -- eg., its kicked off by cron regularly, and then it checks some metadata to see if any work needs to be done, and if not, it just quits. Doesn't seem right to call it "failed". In progress is also tricky, as the app may have been killed without endTime getting written. Anyway, I guess this is OK, just pointing out some reasons why this can be misleading. In particular, I think it would be nicer if spark actually logged whether or not the app was successful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19436 Thanks @HyukjinKwon @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142957552 --- Diff: python/pyspark/sql/functions.py --- @@ -2058,7 +2058,7 @@ def __init__(self, func, returnType, name=None, vectorized=False): self._name = name or ( func.__name__ if hasattr(func, '__name__') else func.__class__.__name__) -self._vectorized = vectorized +self.vectorized = vectorized --- End diff -- Are we ok with having `vectorized` being public field? I am fine with both public or private but I do think the fields of the function returned by `UserDefinedFuncion_wrapped()` should have the same field names as `UserDefinedFunction` to avoid confusion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19436 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19436 Merged to master, branch-2.2 and branch-2.1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142956597 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,65 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col, values) return GroupedData(jgd, self.sql_ctx) +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-function should take a `pandas.DataFrame` and return another `pandas.DataFrame`. +Each group is passed as a `pandas.DataFrame` to the user-function and the returned +`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned `pandas.DataFrame` +can be arbitrary length and its schema should match the returnType of the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: + SKIP ++---+---+ +| id| v| ++---+---+ +| 1|-0.7071067811865475| +| 1| 0.7071067811865475| +| 2|-0.8320502943378437| +| 2|-0.2773500981126146| +| 2| 1.1094003924504583| ++---+---+ + +.. seealso:: :meth:`pyspark.sql.functions.pandas_udf` + +""" +from pyspark.sql.functions import pandas_udf + +# Columns are special because hasattr always return True +if isinstance(udf, Column) or not hasattr(udf, 'func') or not udf.vectorized: +raise ValueError("The argument to apply must be a pandas_udf") +if not isinstance(udf.returnType, StructType): +raise ValueError("The returnType of the pandas_udf must be a StructType") + +df = DataFrame(self._jgd.df(), self.sql_ctx) +func = udf.func +returnType = udf.returnType + +# The python executors expects the function to take a list of pd.Series as input +# So we to create a wrapper function that turns that to a pd.DataFrame before passing +# down to the user function +columns = df.columns + +def wrapped(*cols): +import pandas as pd +return func(pd.concat(cols, axis=1, keys=columns)) --- End diff -- @BryanCutler yeah I was trying to do that earlier that but unfortunately the column names are lost on the worker so we cannot construct the `Pandas.DataFrame` on the worker. I think the best plcae to define the wrap function is probably on the pyspark driver side because we have the most information there. However, that requires some refactoring. I will give it a try and see how that goes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142952213 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- Not sure.. I think what you know is what I usually do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18732 Ongoing discussions that (I think) might block this PR: - https://github.com/apache/spark/pull/18732#discussion_r142735696 by @BryanCutler: `ArrowPandasSerializer` able to serialize pandas.DataFrames - https://github.com/apache/spark/pull/18732#issuecomment-333065737 by @viirya: breaking this definition into two (groupping and normal udfs). - https://github.com/apache/spark/pull/18732#issuecomment-26073 by @rxin and answer https://github.com/apache/spark/pull/18732#issuecomment-333432266 by @icexelloss: naming suggestion --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82477/testReport)** for PR 18732 at commit [`f572385`](https://github.com/apache/spark/commit/f572385e28a1ccd2f8663adf64910d5f0a0ce67c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142949557 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type): -arrow_return_type = toArrowType(return_type) - -def verify_result_length(*a): -result = f(*a) -if not hasattr(result, "__len__"): -raise TypeError("Return type of pandas_udf should be a Pandas.Series") -if len(result) != len(a[0]): -raise RuntimeError("Result vector from pandas_udf was not the required length: " - "expected %d, got %d" % (len(a[0]), len(result))) -return result -return lambda *a: (verify_result_length(*a), arrow_return_type) +if isinstance(return_type, StructType): +arrow_return_types = [to_arrow_type(field.dataType) for field in return_type] + +def fn(*a): --- End diff -- Yes, I will change the name to some thing more descriptive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142949179 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type): -arrow_return_type = toArrowType(return_type) - -def verify_result_length(*a): -result = f(*a) -if not hasattr(result, "__len__"): -raise TypeError("Return type of pandas_udf should be a Pandas.Series") -if len(result) != len(a[0]): -raise RuntimeError("Result vector from pandas_udf was not the required length: " - "expected %d, got %d" % (len(a[0]), len(result))) -return result -return lambda *a: (verify_result_length(*a), arrow_return_type) +if isinstance(return_type, StructType): --- End diff -- Yea, let's add some comments and throws a better exception. For example, I think we should clarify `StructType` should be used in groupping udf only in the exception message. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142948551 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- Ahh..Thanks! Will give it a try. Still, is there a easier way to run the pyspark tests locally (the way jenkins runs them)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142948307 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type): -arrow_return_type = toArrowType(return_type) - -def verify_result_length(*a): -result = f(*a) -if not hasattr(result, "__len__"): -raise TypeError("Return type of pandas_udf should be a Pandas.Series") -if len(result) != len(a[0]): -raise RuntimeError("Result vector from pandas_udf was not the required length: " - "expected %d, got %d" % (len(a[0]), len(result))) -return result -return lambda *a: (verify_result_length(*a), arrow_return_type) +if isinstance(return_type, StructType): +arrow_return_types = [to_arrow_type(field.dataType) for field in return_type] + +def fn(*a): --- End diff -- Yea, but `fn` looks a no-no .. do you maybe have an idea about a better name? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142947514 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- Also, it looks this file does not define `spark` as a global that is used in doctests. I think we should add something like ... ```diff sc = spark.sparkContext globs['sc'] = sc + globs['spark'] = spark globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142946504 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- Probably, importing `pandas_udf` should solve the problem I guess. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142946430 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- I think the problem is, `pandas_udf` is unimportable in this doctest. Up to my knowledge, `# doctest: +SKIP` is per line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142945465 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- I have been using ``` bin/pyspark pyspark.sql.tests GroupbyApplyTests ``` But this doesn't seem to do doctest. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142944123 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +def apply(self, udf): +""" +Maps each group of the current :class:`DataFrame` using a pandas udf and returns the result +as a :class:`DataFrame`. + +The user-defined function should take a `pandas.DataFrame` and return another +`pandas.DataFrame`. Each group is passed as a `pandas.DataFrame` to the user-function and +the returned`pandas.DataFrame` are combined as a :class:`DataFrame`. The returned +`pandas.DataFrame` can be arbitrary length and its schema should match the returnType of +the pandas udf. + +:param udf: A wrapped function returned by `pandas_udf` + +>>> df = spark.createDataFrame( +... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], +... ("id", "v")) +>>> @pandas_udf(returnType=df.schema) +... def normalize(pdf): +... v = pdf.v +... return pdf.assign(v=(v - v.mean()) / v.std()) +>>> df.groupby('id').apply(normalize).show() # doctest: +SKIP --- End diff -- Seems this is still not skipped by doc test. What's the best way to run pyspark test locally? I tried ``` ./run-tests --modules=pyspark-sql --parallelism=4 ``` But it's giving me a different failure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19389: [SPARK-22165][SQL] Resolve type conflicts between decima...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19389 ping @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82474/testReport)** for PR 19436 at commit [`71bf813`](https://github.com/apache/spark/commit/71bf813a4375a5736f903bffb3b17a29d2928d56). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82474/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17357: [SPARK-20025][CORE] Ignore SPARK_LOCAL* env, while deplo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17357 **[Test build #82476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82476/testReport)** for PR 17357 at commit [`b188cc9`](https://github.com/apache/spark/commit/b188cc9a9e290683210d3c4a6841d37ca00b112f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user susanxhuynh commented on the issue: https://github.com/apache/spark/pull/19437 @ArtRand @skonto Please review. Tests passed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19437 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82475/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19437 **[Test build #82475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82475/testReport)** for PR 19437 at commit [`6f062c0`](https://github.com/apache/spark/commit/6f062c00f6382d266619b4a56a753ec27d1db10b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19437 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19437 **[Test build #82475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82475/testReport)** for PR 19437 at commit [`6f062c0`](https://github.com/apache/spark/commit/6f062c00f6382d266619b4a56a753ec27d1db10b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19437: [SPARK-22131][MESOS] Mesos driver secrets
GitHub user susanxhuynh opened a pull request: https://github.com/apache/spark/pull/19437 [SPARK-22131][MESOS] Mesos driver secrets ## Background In #18837 , @ArtRand added Mesos secrets support to the dispatcher. **This PR is to add the same secrets support to the drivers.** This means if the secret configs are set, the driver will launch executors that have access to either env or file-based secrets. One use case for this is to support TLS in the driver <=> executor communication. ## What changes were proposed in this pull request? Most of the changes are a refactor of the dispatcher secrets support (#18837) - moving it to a common place that can be used by both the dispatcher and drivers. The same goes for the unit tests. ## How was this patch tested? There are four config combinations: [env or file-based] x [value or reference secret]. For each combination: - Added a unit test. - Tested in DC/OS. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mesosphere/spark sh-mesos-driver-secret Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19437.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19437 commit b289bcc95f0b67cda94ddf416fc9a15e5d1855b4 Author: Susan X. HuynhDate: 2017-10-04T11:30:31Z [SPARK-22131] Mesos driver secrets. The driver launches executors that have access to env or file-based secrets. commit 6f062c00f6382d266619b4a56a753ec27d1db10b Author: Susan X. Huynh Date: 2017-10-05T12:07:20Z [SPARK-22131] Updated docs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 Thank you, @hhbyyh. I have augmented the example a bit: explicitly set random seed a nd chosen online optimizer: `val lda = new LDA().setK(10).setMaxIter(10).setOptimizer("online").setSeed(13)` But for some reason if I run it twice, the results are not the same. Is that expected? branch-2.2 was used. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19369 **[Test build #3942 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3942/testReport)** for PR 19369 at commit [`d996c28`](https://github.com/apache/spark/commit/d996c283602269afd05dffad1e681f47f7baf47f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/19369 I've fixed the failing `DiskStoreSuite` and ensured the other two suites also pass fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82474/testReport)** for PR 19436 at commit [`71bf813`](https://github.com/apache/spark/commit/71bf813a4375a5736f903bffb3b17a29d2928d56). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19436#discussion_r142903183 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -3075,6 +3075,11 @@ test_that("gapply() and gapplyCollect() on a DataFrame", { df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x }) expect_identical(df1Collect, expected) + # gapply on empty grouping columns. + dfTwoPartition <- repartition(df, 2L) + df1TwoPartition <- gapply(dfTwoPartition, c(), function(key, x) { x }, schema(dfTwoPartition)) + expect_identical(sort(collect(df1TwoPartition)), sort(expected)) --- End diff -- Ok. Let me use your test code. I don't want to block this PR. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19436#discussion_r142902434 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -3075,6 +3075,11 @@ test_that("gapply() and gapplyCollect() on a DataFrame", { df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x }) expect_identical(df1Collect, expected) + # gapply on empty grouping columns. + dfTwoPartition <- repartition(df, 2L) + df1TwoPartition <- gapply(dfTwoPartition, c(), function(key, x) { x }, schema(dfTwoPartition)) + expect_identical(sort(collect(df1TwoPartition)), sort(expected)) --- End diff -- hmm, I think it should work. `repartition` is not necessary. I'm just wondering how to test this in R... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19436#discussion_r142901810 --- Diff: R/pkg/tests/fulltests/test_sparkSQL.R --- @@ -3075,6 +3075,11 @@ test_that("gapply() and gapplyCollect() on a DataFrame", { df1Collect <- gapplyCollect(df, list("a"), function(key, x) { x }) expect_identical(df1Collect, expected) + # gapply on empty grouping columns. + dfTwoPartition <- repartition(df, 2L) + df1TwoPartition <- gapply(dfTwoPartition, c(), function(key, x) { x }, schema(dfTwoPartition)) + expect_identical(sort(collect(df1TwoPartition)), sort(expected)) --- End diff -- Actually, I tested these: ```R df1 <- gapply(df, c(), function(key, x) { x }, schema(df)) actual <- collect(df1) expect_identical(actual, expected) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19436 Let me install R environment to test it locally... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...
Github user skonto commented on the issue: https://github.com/apache/spark/pull/19374 @ArtRand @susanxhuynh gentle ping. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19369: [SPARK-22147][CORE] Removed redundant allocations...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19369#discussion_r142896027 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -67,7 +67,7 @@ private[spark] class DiskStore( var threwException: Boolean = true try { writeFunc(out) - blockSizes.put(blockId.name, out.getCount) + blockSizes.put(blockId, out.getCount) --- End diff -- @superbobry I think the last test failure is legit as you need to update the call to remove(blockId.name) on about line 116. I was surprised it even compiles, but, for legacy reasons the JDK collections classes don't have generic types on methods like remove, so it accepts any object. That however should be the last change here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82473/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82473/testReport)** for PR 19436 at commit [`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19369 **[Test build #3941 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3941/testReport)** for PR 19369 at commit [`8590efe`](https://github.com/apache/spark/commit/8590efec78638735f170e9f6d2fd04c65724e20e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19429: [SPARK-20055] [Docs] Added documentation for loading csv...
Github user jomach commented on the issue: https://github.com/apache/spark/pull/19429 @felixcheung Sorry for that. Should be there now. Can you test ? thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19369: [SPARK-22147][CORE] Removed redundant allocations from B...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19369 **[Test build #3941 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3941/testReport)** for PR 19369 at commit [`8590efe`](https://github.com/apache/spark/commit/8590efec78638735f170e9f6d2fd04c65724e20e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82473/testReport)** for PR 19436 at commit [`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19436 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82470/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82470/testReport)** for PR 19436 at commit [`6710141`](https://github.com/apache/spark/commit/6710141767a2df92898af319bc4ef87f9110f911). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19436 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82472/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82472/testReport)** for PR 19436 at commit [`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82469/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82469 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82469/testReport)** for PR 18732 at commit [`e4efb32`](https://github.com/apache/spark/commit/e4efb3281008a2b450f9013aeb8f1ac9cf4ffa9e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19436: [SPARK-22206][SQL][SparkR] gapply in R can't work on emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19436 **[Test build #82472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82472/testReport)** for PR 19436 at commit [`0e111a8`](https://github.com/apache/spark/commit/0e111a8d095c9ecdb9fb8249332b9e12c15e8fce). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...
Github user squito commented on the issue: https://github.com/apache/spark/pull/19287 lgtm, thanks @xuanyuanking @jerryshao can you merge this? I will have very intermittent access for a few weeks, I'd prefer not to merge in case there is any issue that needs an urgent followup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r142854372 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -322,6 +326,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { this } + @Since("2.3.0") + def setEpsilon(epsilon: Double): this.type = { +require(epsilon> 0, s"LDA epsilon must be positive, but was set to $epsilon") --- End diff -- space after epsilon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r142853109 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * @group expertParam --- End diff -- parameter comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r142853643 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * @group expertParam + */ + @Since("2.3.0") + final val epsilon = new DoubleParam(this, "epsilon", "(For online optimizer)" + +" A (positive) learning parameter that controls the convergence of variational inference.", --- End diff -- The parameter introduction here cannot really help a user without knowledge of LDA implementation. Please add more description for the effect if user want to tune the parameter, such like "Smaller value will lead to higher accuracy with the cost of more iterations." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org