[GitHub] spark issue #20455: [SPARK-23284][SQL] Document the behavior of several Colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20455 **[Test build #86921 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86921/testReport)** for PR 20455 at commit [`5246fcc`](https://github.com/apache/spark/commit/5246fcc5bb5936d64991fe7eb6acdd4cbdc25e05). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20455: [SPARK-23284][SQL] Document the behavior of several Colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20455 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20455: [SPARK-23284][SQL] Document the behavior of several Colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20455 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/470/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20466 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86912/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20466 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20466 **[Test build #86912 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86912/testReport)** for PR 20466 at commit [`6e55d10`](https://github.com/apache/spark/commit/6e55d1000c62a86c14ad993d3699b0ed99f53cbb). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20469: [SPARK-23295][Build][Minor]Exclude Waring message when g...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20469 **[Test build #86920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86920/testReport)** for PR 20469 at commit [`15d67ee`](https://github.com/apache/spark/commit/15d67eee9baa87a8fa08a265549000386fd476a6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20469: [SPARK-23295][Build][Minor]Exclude Waring message when g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20469 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/469/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20469: [SPARK-23295][Build][Minor]Exclude Waring message when g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20469 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20469: [SPARK-23295][Build][Minor]Exclude Waring message...
GitHub user yaooqinn opened a pull request: https://github.com/apache/spark/pull/20469 [SPARK-23295][Build][Minor]Exclude Waring message when generating versions in make-distribution.sh ## What changes were proposed in this pull request? When we specified a wrong profile to make a spark distribution, such as `-Phadoop1000`, we will get an odd package named like `spark-[WARNING] The requested profile "hadoop1000" could not be activated because it does not exist.-bin-hadoop-2.7.tgz`, which actually should be `"spark-$VERSION-bin-$NAME.tgz"` ## How was this patch tested? ### before ``` build/mvn help:evaluate -Dexpression=scala.binary.version -Phadoop1000 2>/dev/null | grep -v "INFO" | tail -n 1 [WARNING] The requested profile "hadoop1000" could not be activated because it does not exist. ``` ``` build/mvn help:evaluate -Dexpression=project.version -Phadoop1000 2>/dev/null | grep -v "INFO" | tail -n 1 [WARNING] The requested profile "hadoop1000" could not be activated because it does not exist. ``` ### after build/mvn help:evaluate -Dexpression=project.version -Phadoop1000 2>/dev/null | grep -v "INFO" | grep -v "WARNING" | tail -n 1 2.4.0-SNAPSHOT ``` ``` build/mvn help:evaluate -Dexpression=scala.binary.version -Dscala.binary.version=2.11.1 2>/dev/null | grep -v "INFO" | grep -v "WARNING" | tail -n 1 2.11.1 ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/yaooqinn/spark dist-minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20469.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20469 commit 15d67eee9baa87a8fa08a265549000386fd476a6 Author: Kent Yao Date: 2018-02-01T07:27:00Z exclude warning patten too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20400 **[Test build #86919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86919/testReport)** for PR 20400 at commit [`25fee39`](https://github.com/apache/spark/commit/25fee3901cfba3599330da394e437c91a9783368). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20455: [SPARK-23284][SQL] Document the behavior of several Colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20455 **[Test build #86918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86918/testReport)** for PR 20455 at commit [`7a1fd57`](https://github.com/apache/spark/commit/7a1fd57925a080116c288ca1793af86258019494). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20400 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/468/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), unbounded...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20400 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20455: [SPARK-23284][SQL] Document the behavior of several Colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20455 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/467/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20455: [SPARK-23284][SQL] Document the behavior of several Colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20455 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20455: [SPARK-23284][SQL] Document the behavior of several Colu...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20455 Since the map support is added, I'll do related change later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17886: [SPARK-13983][SQL] Fix HiveThriftServer2 can not get "--...
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/17886 @gatorsmile this is a great patch. The test can be improved, but I think it is safe to merge as it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20465 Yes, the tests are being run with python3. I do prefer to have these conditional skips removed because sometimes it is hard to tell if everything passed or was just skipped. But since pandas and pyarrow are optional dependencies, there should be some way for the user to skip with an environment variable or something. At the very least, being able to verify they were run in a log would be good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20461: [SPARK-23289][CORE]OneForOneBlockFetcher.DownloadCallbac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20461 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/466/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20461: [SPARK-23289][CORE]OneForOneBlockFetcher.DownloadCallbac...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20461 **[Test build #86917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86917/testReport)** for PR 20461 at commit [`fed6dc2`](https://github.com/apache/spark/commit/fed6dc25c6293cad08e6759bc0a1cf414b91dfd0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20461: [SPARK-23289][CORE]OneForOneBlockFetcher.DownloadCallbac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20461 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20461: [SPARK-23289][CORE]OneForOneBlockFetcher.Download...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/20461#discussion_r165276461 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java --- @@ -171,7 +171,9 @@ private void failRemainingBlocks(String[] failedBlockIds, Throwable e) { @Override public void onData(String streamId, ByteBuffer buf) throws IOException { - channel.write(buf); + while (buf.hasRemaining()) { +channel.write(buf); --- End diff -- @ConeyLiu Good catch. Let me also fix it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20468: [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/20468 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/20465 So, jenkins jobs run those tests with python3? If so, I feel better because those tests are not completely skipped in Jenkins. If it is hard to make them run with python 2. Letâs have a log to explicitly show if we are going to run tests using pandas/pyarrow, which will help us confirm if they get exercised with python 3 in Jenkins or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86907/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20422 **[Test build #86907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86907/testReport)** for PR 20422 at commit [`f3f3627`](https://github.com/apache/spark/commit/f3f3627a60df471649a75c5d058f9349f8c520cc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20465 Yup, there was a related discussion already. See this https://github.com/apache/spark/pull/19884#issuecomment-351916074 and https://github.com/apache/spark/pull/19884#issuecomment-353068446. We shouldn't run this for now. Also these are technically not hard dependencies. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19219: [SPARK-21993][SQL] Close sessionState when finish
Github user liufengdb commented on the issue: https://github.com/apache/spark/pull/19219 The major issue this PR tries to cover has been fixed by https://github.com/apache/spark/pull/20029, so I think we are good if there are no calls to `HiveClientImpl.newSession`. We can close this PR with no-fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20465 Looking back at when pyarrow was last upgraded in #19884, pandas and pyarrow were upgraded on all workers for python 3, but there were maybe some concerns or difficulties with upgrading for python 2 and pypy environments at that time. That is why the above failure is from python 2 with an older version of pandas. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20468: [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20468 thanks! LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20468: [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20468 **[Test build #86916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86916/testReport)** for PR 20468 at commit [`c44c477`](https://github.com/apache/spark/commit/c44c47701d337328493080a83d012abb35065ac2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20468: [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20468 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20468: [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20468 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/465/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20468: [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20468 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20468: [SPARK-23280][SQL][FOLLOWUP] Fix Java style check...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20468 [SPARK-23280][SQL][FOLLOWUP] Fix Java style check issues. ## What changes were proposed in this pull request? This is a follow-up of #20450 which broke lint-java checks. This pr fixes the lint-java issues. ``` [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. [ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData. ``` ## How was this patch tested? Checked manually in my local environment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-23280/fup1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20468.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20468 commit c44c47701d337328493080a83d012abb35065ac2 Author: Takuya UESHIN Date: 2018-02-01T06:50:43Z Fix Java style check issues. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20464#discussion_r165271143 --- Diff: R/pkg/R/column.R --- @@ -169,7 +169,7 @@ setMethod("alias", #' @note substr since 1.4.0 setMethod("substr", signature(x = "Column"), function(x, start, stop) { -jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1)) +jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1)) --- End diff -- This API behavior should be considered as wrong and performs inconsistently. Because for starting position 1, we get substring from 1st element, but for position 2, we still get the substring from 1. So we will get the following inconsistent results: ```R > collect(select(df, substr(df$a, 1, 5))) substring(a, 0, 5) 1 abcde > collect(select(df, substr(df$a, 2, 5))) substring(a, 1, 4) 1 abcd ``` For such change, we might need to add a note in the doc as @HyukjinKwon suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86906/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20164 @srowen Different from the base model (like LoR), OVR and OVRModel do not have param `rawPredictionCol`. So if the input dataframe contains a column which has the same name as base model's `getRawPredictionCol`, then OVRModel can not transform the input. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20422 **[Test build #86906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86906/testReport)** for PR 20422 at commit [`246dbca`](https://github.com/apache/spark/commit/246dbcab7e4829b70e39d588a34b8322a6ede54f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20400: [SPARK-23084][PYTHON]Add unboundedPreceding(), un...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20400#discussion_r165270774 --- Diff: python/pyspark/sql/window.py --- @@ -129,11 +131,34 @@ def rangeBetween(start, end): :param end: boundary end, inclusive. The frame is unbounded if this is ``Window.unboundedFollowing``, or any value greater than or equal to min(sys.maxsize, 9223372036854775807). + +>>> from pyspark.sql import functions as F, SparkSession, Window +>>> spark = SparkSession.builder.getOrCreate() +>>> df = spark.createDataFrame([(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), +... (3, "b")], ["id", "category"]) +>>> window = Window.orderBy("id").partitionBy("category").rangeBetween(F.currentRow(), +... F.lit(1)) +>>> df.withColumn("sum", F.sum("id").over(window)).show() ++---++---+ +| id|category|sum| ++---++---+ +| 1| b| 3| +| 2| b| 5| +| 3| b| 3| +| 1| a| 4| +| 1| a| 4| +| 2| a| 2| ++---++---+ + --- End diff -- Seems to me this is required. I will change the rest except this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20422 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86905/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20422: [SPARK-23253][Core][Shuffle]Only write shuffle temporary...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20422 **[Test build #86905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86905/testReport)** for PR 20422 at commit [`a96f6c4`](https://github.com/apache/spark/commit/a96f6c460d89e5731b340f264b8085d0611974e1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertR...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20467 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20467 Thanks, @gatorsmile. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20467 Fine. I just merged it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20467 Sorry, actually I am hitting a network problem. Let me try it latter if it's merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20467 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r165268323 --- Diff: python/pyspark/sql/tests.py --- @@ -4353,6 +4347,446 @@ def test_unsupported_types(self): df.groupby('id').apply(f).collect() +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") +class GroupbyAggPandasUDFTests(ReusedSQLTestCase): + +@property +def data(self): +from pyspark.sql.functions import array, explode, col, lit +return self.spark.range(10).toDF('id') \ +.withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ +.withColumn("v", explode(col('vs'))) \ +.drop('vs') \ +.withColumn('w', lit(1.0)) + +@property +def python_plus_one(self): +from pyspark.sql.functions import udf + +@udf('double') +def plus_one(v): +assert isinstance(v, (int, float)) +return v + 1 +return plus_one + +@property +def pandas_scalar_plus_two(self): +import pandas as pd +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.SCALAR) +def plus_two(v): +assert isinstance(v, pd.Series) +return v + 2 +return plus_two + +@property +def pandas_agg_mean_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def avg(v): +return v.mean() +return avg + +@property +def pandas_agg_sum_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def sum(v): +return v.sum() +return sum + +@property +def pandas_agg_weighted_mean_udf(self): +import numpy as np +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def weighted_mean(v, w): +return np.average(v, weights=w) +return weighted_mean + +def test_manual(self): +df = self.data +sum_udf = self.pandas_agg_sum_udf +mean_udf = self.pandas_agg_mean_udf + +result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id') +expected1 = self.spark.createDataFrame( +[[0, 245.0, 24.5], + [1, 255.0, 25.5], + [2, 265.0, 26.5], + [3, 275.0, 27.5], + [4, 285.0, 28.5], + [5, 295.0, 29.5], + [6, 305.0, 30.5], + [7, 315.0, 31.5], + [8, 325.0, 32.5], + [9, 335.0, 33.5]], +['id', 'sum(v)', 'avg(v)']) + +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +def test_basic(self): +from pyspark.sql.functions import col, lit, sum, mean + +df = self.data +weighted_mean_udf = self.pandas_agg_weighted_mean_udf + +# Groupby one column and aggregate one UDF with literal +result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id') +expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id') +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +# Groupby one expression and aggregate one UDF with literal +result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\ +.sort(df.id + 1) +expected2 = df.groupby((col('id') + 1))\ +.agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1) +self.assertPandasEqual(expected2.toPandas(), result2.toPandas()) + +# Groupby one column and aggregate one UDF without literal +result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id') +expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id') +self.assertPandasEqual(expected3.toPandas(), result3.toPandas()) + +# Groupby one expression and aggregate one UDF without literal +result4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(weighted_mean_udf(df.v, df.w))\ +.sort('id') +expected4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(mean(df.v).alias('weighted_mean(v, w)'))\ +.sort('id') +self.assertPandasEqual(ex
[GitHub] spark issue #20462: [SPARK-23020][core] Fix another race in the in-process l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20462 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20460: [SPARK-23285][K8S] Allow fractional values for spark.exe...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20460 I'd target this 2.3 & master. Waiting for tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20460: [SPARK-23285][K8S] Allow fractional values for spark.exe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20460 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/462/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20462: [SPARK-23020][core] Fix another race in the in-process l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20462 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86904/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20462: [SPARK-23020][core] Fix another race in the in-process l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20462 **[Test build #86904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86904/testReport)** for PR 20462 at commit [`b967775`](https://github.com/apache/spark/commit/b96777573bdc9dc92b3419fb44bbd790117ee00e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20460: [SPARK-23285][K8S] Allow fractional values for spark.exe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20460 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/462/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20460: [SPARK-23285][K8S] Allow fractional values for spark.exe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20460 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20460: [SPARK-23285][K8S] Allow fractional values for spark.exe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20460 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/464/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20465 ``` ImportError: Pandas >= 0.19.2 must be installed on calling Python process; however, your version was 0.16.0. ``` I guess the RISELab boxes will need some updates... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20465 sure, that's ok, I think we can revisit later (ie. next release) if we want to add an env switch or something to make them optional --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20467 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86913/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20467 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20467 **[Test build #86913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86913/testReport)** for PR 20467 at commit [`e9a1500`](https://github.com/apache/spark/commit/e9a1500be55a9b8a9affcd2513afc262cc2a666b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20465 > I think there are some values in having a way to run python tests without Arrow? I agree, but the more important thing is to make sure jenkins runs everything, so that we can be confident about our release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20460: [SPARK-23285][K8S] Allow fractional values for spark.exe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20460 **[Test build #86915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86915/testReport)** for PR 20460 at commit [`d9805c3`](https://github.com/apache/spark/commit/d9805c3e4d4795f866e72f3c30f8ca29db90761d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20460: [SPARK-23285][K8S] Allow fractional values for spark.exe...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20460 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20454: [SPARK-23202][SQL] Add new API in DataSourceWriter: onDa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20454 **[Test build #86914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86914/testReport)** for PR 20454 at commit [`4ae9b5e`](https://github.com/apache/spark/commit/4ae9b5e4da575066fc36753793fa6437f18a1ddf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20454: [SPARK-23202][SQL] Add new API in DataSourceWriter: onDa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20454 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/463/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20454: [SPARK-23202][SQL] Add new API in DataSourceWriter: onDa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20454 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-aware ti...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18933 Ping. I ran into this exact issue with pandas_udf on a simple data set with a timestamp type column. As far as I can tell, there is no way to around this since pandas code is running deep inside pyspark and the only workaround is to make the column a string? @BryanCutler @ueshin @icexelloss @HyukjinKwon any thought on how to fix this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/20465 @felixcheung jenkins is actually skipping those tests (see the failure of this pr). It makes sense to provide a way to allow developers to not run those tests. But, I'd prefer that we run those tests by default. So, we can make sure that jenkins is doing the right thing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20465 hmm, I think there are some values in having a way to run python tests without Arrow? I mean the test.py is not just for Jenkins but for everyone consuming the Spark release... unless we are saying Arrow is required now? And in Jenkins we shouldn't be skipping any of these tests anyway? Is there a reason we need to change that if Jenkins isn't affected (and if I recall there is a way to check if running under Jenkins - we could always make Arrow tests not skipped in Jenkins) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20383: [SPARK-23200] Reset Kubernetes-specific config on Checkp...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20383 I agree. sorry to merge it so quickly, let me revert it. @ssaavedra would you please submit PR again when everything is done, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20383: [SPARK-23200] Reset Kubernetes-specific config on Checkp...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20383 my take is not that it doesn't work but some names are out of date because it was done for the k8s fork. I think we should revert the commit and wait till it is tested out complete. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20464: [SPARK-23291][SQL][R] R's substr should not reduc...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20464#discussion_r165263961 --- Diff: R/pkg/R/column.R --- @@ -169,7 +169,7 @@ setMethod("alias", #' @note substr since 1.4.0 setMethod("substr", signature(x = "Column"), function(x, start, stop) { -jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1)) +jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1)) --- End diff -- I'm a bit concern with changing this. As you can see it's been like this from the very beginning... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20467 **[Test build #86913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86913/testReport)** for PR 20467 at commit [`e9a1500`](https://github.com/apache/spark/commit/e9a1500be55a9b8a9affcd2513afc262cc2a666b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20457: [SPARK-23110][MINOR] Make linearRegressionModel construc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20457 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86909/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20457: [SPARK-23110][MINOR] Make linearRegressionModel construc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20457 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20467 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/462/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20457: [SPARK-23110][MINOR] Make linearRegressionModel construc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20457 **[Test build #86909 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86909/testReport)** for PR 20457 at commit [`cdcce18`](https://github.com/apache/spark/commit/cdcce18425ee669b99323cf94bd04015ee080439). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20467 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r165262989 --- Diff: python/pyspark/sql/tests.py --- @@ -4353,6 +4347,446 @@ def test_unsupported_types(self): df.groupby('id').apply(f).collect() +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") +class GroupbyAggPandasUDFTests(ReusedSQLTestCase): + +@property +def data(self): +from pyspark.sql.functions import array, explode, col, lit +return self.spark.range(10).toDF('id') \ +.withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ +.withColumn("v", explode(col('vs'))) \ +.drop('vs') \ +.withColumn('w', lit(1.0)) + +@property +def python_plus_one(self): +from pyspark.sql.functions import udf + +@udf('double') +def plus_one(v): +assert isinstance(v, (int, float)) +return v + 1 +return plus_one + +@property +def pandas_scalar_plus_two(self): +import pandas as pd +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.SCALAR) +def plus_two(v): +assert isinstance(v, pd.Series) +return v + 2 +return plus_two + +@property +def pandas_agg_mean_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def avg(v): +return v.mean() +return avg + +@property +def pandas_agg_sum_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def sum(v): +return v.sum() +return sum + +@property +def pandas_agg_weighted_mean_udf(self): +import numpy as np +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def weighted_mean(v, w): +return np.average(v, weights=w) +return weighted_mean + +def test_manual(self): +df = self.data +sum_udf = self.pandas_agg_sum_udf +mean_udf = self.pandas_agg_mean_udf + +result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id') +expected1 = self.spark.createDataFrame( +[[0, 245.0, 24.5], + [1, 255.0, 25.5], + [2, 265.0, 26.5], + [3, 275.0, 27.5], + [4, 285.0, 28.5], + [5, 295.0, 29.5], + [6, 305.0, 30.5], + [7, 315.0, 31.5], + [8, 325.0, 32.5], + [9, 335.0, 33.5]], +['id', 'sum(v)', 'avg(v)']) + +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +def test_basic(self): +from pyspark.sql.functions import col, lit, sum, mean + +df = self.data +weighted_mean_udf = self.pandas_agg_weighted_mean_udf + +# Groupby one column and aggregate one UDF with literal +result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id') +expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id') +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +# Groupby one expression and aggregate one UDF with literal +result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\ +.sort(df.id + 1) +expected2 = df.groupby((col('id') + 1))\ +.agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1) +self.assertPandasEqual(expected2.toPandas(), result2.toPandas()) + +# Groupby one column and aggregate one UDF without literal +result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id') +expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id') +self.assertPandasEqual(expected3.toPandas(), result3.toPandas()) + +# Groupby one expression and aggregate one UDF without literal +result4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(weighted_mean_udf(df.v, df.w))\ +.sort('id') +expected4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(mean(df.v).alias('weighted_mean(v, w)'))\ +.sort('id') +self.assertPandasEqual(expecte
[GitHub] spark pull request #20467: [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertR...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/20467 [SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRegexp` instead of `assertRaisesRegex`. ## What changes were proposed in this pull request? This is a follow-up pr of #19872 which uses `assertRaisesRegex` but it doesn't exist in Python 2, so some tests fail when running tests in Python 2 environment. Unfortunately, we missed it because currently Python 2 environment of the pr builder doesn't have proper versions of pandas or pyarrow, so the tests were skipped. This pr modifies to use `assertRaisesRegexp` instead of `assertRaisesRegex`. ## How was this patch tested? Tested manually in my local environment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-22274/fup1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20467.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20467 commit e9a1500be55a9b8a9affcd2513afc262cc2a666b Author: Takuya UESHIN Date: 2018-02-01T05:13:59Z Use `assertRaisesRegexp` instead of `assertRaisesRegex`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20464: [SPARK-23291][SQL][R] R's substr should not reduce start...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20464 > One followup question is though, would it be difficult to match the behaviour with substr in R when the index is 0 or minus? If i understood #20464 (comment) correctly, it sounds better to match it to substr's behaviour in R. Took a quick look/test and seems we can just set start to 1 for both cases. If we both consider the indices at starting and ending, setting them to 1 seems not enough. E.g., ```R > substr("abcdef", -2, -3) [1] "" > substr("abcdef", 1, 1) [1] "a" ``` For the cases when only ending is zero/negative, no matter what starting is, the result is empty string. For the cases when only starting is zero/negative, we can set it to 1. For the cases they are both zero/negative, the result is empty string. We can address this in another PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20424: [Spark-23240][python] Better error message when e...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20424#discussion_r165261830 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) -daemonPort = in.readInt() +try { + daemonPort = in.readInt() +} catch { + case exc: EOFException => +throw new IOException(s"No port number in $daemonModule's stdout") +} + +// test that the returned port number is within a valid range. +// note: this does not cover the case where the port number +// is arbitrary data but is also coincidentally within range +if (daemonPort < 1 || daemonPort > 0x) { --- End diff -- Ah, OK. Thanks for clarification. Maybe, I was caring too much about it. Thanks all for bearing with me. I am fine as is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r165261550 --- Diff: python/pyspark/sql/tests.py --- @@ -4353,6 +4347,446 @@ def test_unsupported_types(self): df.groupby('id').apply(f).collect() +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") +class GroupbyAggPandasUDFTests(ReusedSQLTestCase): + +@property +def data(self): +from pyspark.sql.functions import array, explode, col, lit +return self.spark.range(10).toDF('id') \ +.withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ +.withColumn("v", explode(col('vs'))) \ +.drop('vs') \ +.withColumn('w', lit(1.0)) + +@property +def python_plus_one(self): +from pyspark.sql.functions import udf + +@udf('double') +def plus_one(v): +assert isinstance(v, (int, float)) +return v + 1 +return plus_one + +@property +def pandas_scalar_plus_two(self): +import pandas as pd +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.SCALAR) +def plus_two(v): +assert isinstance(v, pd.Series) +return v + 2 +return plus_two + +@property +def pandas_agg_mean_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def avg(v): +return v.mean() +return avg + +@property +def pandas_agg_sum_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def sum(v): +return v.sum() +return sum + +@property +def pandas_agg_weighted_mean_udf(self): +import numpy as np +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def weighted_mean(v, w): +return np.average(v, weights=w) +return weighted_mean + +def test_manual(self): +df = self.data +sum_udf = self.pandas_agg_sum_udf +mean_udf = self.pandas_agg_mean_udf + +result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id') +expected1 = self.spark.createDataFrame( +[[0, 245.0, 24.5], + [1, 255.0, 25.5], + [2, 265.0, 26.5], + [3, 275.0, 27.5], + [4, 285.0, 28.5], + [5, 295.0, 29.5], + [6, 305.0, 30.5], + [7, 315.0, 31.5], + [8, 325.0, 32.5], + [9, 335.0, 33.5]], +['id', 'sum(v)', 'avg(v)']) + +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +def test_basic(self): +from pyspark.sql.functions import col, lit, sum, mean + +df = self.data +weighted_mean_udf = self.pandas_agg_weighted_mean_udf + +# Groupby one column and aggregate one UDF with literal +result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id') +expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id') +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +# Groupby one expression and aggregate one UDF with literal +result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\ +.sort(df.id + 1) +expected2 = df.groupby((col('id') + 1))\ +.agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1) +self.assertPandasEqual(expected2.toPandas(), result2.toPandas()) + +# Groupby one column and aggregate one UDF without literal +result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id') +expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id') +self.assertPandasEqual(expected3.toPandas(), result3.toPandas()) + +# Groupby one expression and aggregate one UDF without literal +result4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(weighted_mean_udf(df.v, df.w))\ +.sort('id') +expected4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(mean(df.v).alias('weighted_mean(v, w)'))\ +.sort('id') +self.assertPandasEqual(expecte
[GitHub] spark issue #19219: [SPARK-21993][SQL] Close sessionState when finish
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19219 cc @liufengdb --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20466 **[Test build #86912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86912/testReport)** for PR 20466 at commit [`6e55d10`](https://github.com/apache/spark/commit/6e55d1000c62a86c14ad993d3699b0ed99f53cbb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20466 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/461/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20465 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20466 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20465 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86910/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20465 **[Test build #86910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86910/testReport)** for PR 20465 at commit [`8aba4f5`](https://github.com/apache/spark/commit/8aba4f502879b7e3b8c154b00ded22e4bcba8df2). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20466 cc @gatorsmile @rdblue @sameeragarwal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20466: [SPARK-23293][SQL] fix data source v2 self join
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20466 [SPARK-23293][SQL] fix data source v2 self join ## What changes were proposed in this pull request? `DataSourceV2Relation` should extend `MultiInstanceRelation`, to take care of self-join. ## How was this patch tested? a new test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark dsv2-selfjoin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20466.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20466 commit 6e55d1000c62a86c14ad993d3699b0ed99f53cbb Author: Wenchen Fan Date: 2018-02-01T05:07:07Z fix data source v2 self join --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19219: [SPARK-21993][SQL] Close sessionState when finish
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19219 **[Test build #86911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86911/testReport)** for PR 19219 at commit [`e421113`](https://github.com/apache/spark/commit/e4211137bdc72c3e94d7bce2944d108e5cb70b55). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20463: [SQL][MINOR] Inline SpecifiedWindowFrame.defaultW...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20463 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20361 Not a bug fix. This is not qualified for merging to Spark 2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org