[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22162#discussion_r213158056 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -969,6 +969,22 @@ class DatasetSuite extends QueryTest with SharedSQLContext { checkShowString(ds, expected) } + + test("SPARK-2444git stat2 Show should follow spark.show.default.number.of.rows") { +withSQLConf("spark.sql.show.defaultNumRows" -> "100") { + val ds = (1 to 1000).toDS().as[Int].show --- End diff -- I think its ok to check the output number of rows in show. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22162 ya, sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22162#discussion_r213157406 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -815,6 +815,24 @@ class Dataset[T] private[sql]( println(showString(numRows, truncate, vertical)) // scalastyle:on println + /** + * Returns the default number of rows to show when the show function is called without + * a user specified max number of rows. + * @since 2.3.0 + */ + private def numberOfRowsToShow(): Int = { +this.sparkSession.conf.get("spark.sql.show.defaultNumRows", "20").toInt --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22162 We should wait @AndrewKL for few days? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21546 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95310/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21546 **[Test build #95310 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95310/testReport)** for PR 21546 at commit [`2fe46f8`](https://github.com/apache/spark/commit/2fe46f82dc38af972bc0974aca1fd846bcb483e5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/7#discussion_r213154483 --- Diff: sql/core/src/test/resources/sql-tests/inputs/string-functions.sql --- @@ -5,6 +5,10 @@ select format_string(); -- A pipe operator for string concatenation select 'a' || 'b' || 'c'; +-- split function +select split('aa1cc2ee', '[1-9]+', 2); +select split('aa1cc2ee', '[1-9]+'); + --- End diff -- Can you move these tests to the end of this file in order to decrease unnecessary changes in the golden file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22198 **[Test build #95322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95322/testReport)** for PR 22198 at commit [`83387f6`](https://github.com/apache/spark/commit/83387f6f3b86532a79e83e8483c5e4683ff8beac). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22198 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22198: [SPARK-25121][SQL] Supports multi-part table names for b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22198 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2597/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22162 I have much bandwidh to take it, too. Is it ok to take it over? @mgaido91 not working on this now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21976 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21976 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95307/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API
Github user NiharS commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r213150133 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -130,6 +130,16 @@ private[spark] class Executor( private val urlClassLoader = createClassLoader() private val replClassLoader = addReplClassLoaderIfNeeded(urlClassLoader) + // One thread will handle loading all of the plugins on this executor --- End diff -- That does make sense. While I did say "aside from semantics", semantics is a good reason to include it. Especially since it'll be harder to get plugin writers to adopt an `init` function later. I'll make the other changes and make sure the tests still pass, if anyone does feel strongly (or even weakly) on one way over another I don't think there's much harm in either approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21976 **[Test build #95307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95307/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22238: [SPARK-25245][DOCS][SS] Explain regarding limiting modif...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22238 **[Test build #95321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95321/testReport)** for PR 22238 at commit [`138cc63`](https://github.com/apache/spark/commit/138cc63e639b60fb7e803097654816ad6c19c95f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22149: [SPARK-25158][SQL]Executor accidentally exit because Scr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22149 **[Test build #95320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95320/testReport)** for PR 22149 at commit [`412497f`](https://github.com/apache/spark/commit/412497f2ad615e5aeecb91e7fd5053864a00be37). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22210: [SPARK-25218][Core]Fix potential resource leaks in Trans...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/22210 LGTM! Good catches --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22149: [SPARK-25158][SQL]Executor accidentally exit because Scr...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22149 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22149: [SPARK-25158][SQL]Executor accidentally exit because Scr...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22149 Is that possible to add a test case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/22010 I did a quick micro-benchmark on this and got: > scala> :paste > // Entering paste mode (ctrl-D to finish) > > import scala.collection.{mutable, Map} > def removeDuplicatesInPartition(itr: Iterator[Int]): Iterator[Int] = { > val set = new mutable.HashSet[Int]() > itr.filter(set.add(_)) > } > > def time[R](block: => R): (Long, R) = { > val t0 = System.nanoTime() > val result = block// call-by-name > val t1 = System.nanoTime() > println("Elapsed time: " + (t1 - t0) + "ns") > (t1, result) > } > > val count = 100 > val inputData = sc.parallelize(1.to(count)).cache() > inputData.count() > > val o1 = time(inputData.distinct().count()) > val n1 = time(inputData.mapPartitions(removeDuplicatesInPartition).count()) > val n2 = time(inputData.mapPartitions(removeDuplicatesInPartition).count()) > val o2 = time(inputData.distinct().count()) > val n3 = time(inputData.mapPartitions(removeDuplicatesInPartition).count()) > > > // Exiting paste mode, now interpreting. > > Elapsed time: 2464151504ns > Elapsed time: 219130154ns > Elapsed time: 133545428ns > Elapsed time: 927133584ns > Elapsed time: 242432642ns > import scala.collection.{mutable, Map} > removeDuplicatesInPartition: (itr: Iterator[Int])Iterator[Int] > time: [R](block: => R)(Long, R) > count: Int = 100 > inputData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at parallelize at :47 > o1: (Long, Long) = (437102431151279,100) > n1: (Long, Long) = (437102654798968,100) > n2: (Long, Long) = (437102792389328,100) > o2: (Long, Long) = (437103724196085,100) > n3: (Long, Long) = (437103971061275,100) > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the aggregated stage metrics b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22209 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95305/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the aggregated stage metrics b...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22209 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22211: [SPARK-23207][SPARK-22905][SPARK-24564][SPARK-25114][SQL...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22211 Thanks! Merged to 2.1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22209: [SPARK-24415][Core] Fixed the aggregated stage metrics b...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22209 **[Test build #95305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95305/testReport)** for PR 22209 at commit [`0552af0`](https://github.com/apache/spark/commit/0552af0abb484c1b9129a0091b2057e06d5ab4ac). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the aggregated stage me...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r213143932 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/UISeleniumSuite.scala --- @@ -77,7 +77,14 @@ class UISeleniumSuite inputStream.foreachRDD { rdd => rdd.foreach(_ => {}) try { -rdd.foreach(_ => throw new RuntimeException("Oops")) +rdd.foreach(_ => { --- End diff -- Since you're touching this: `.foreach { _ =>` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22209: [SPARK-24415][Core] Fixed the aggregated stage me...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22209#discussion_r213143804 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -350,11 +350,22 @@ private[spark] class AppStatusListener( val e = it.next() if (job.stageIds.contains(e.getKey()._1)) { val stage = e.getValue() - stage.status = v1.StageStatus.SKIPPED - job.skippedStages += stage.info.stageId - job.skippedTasks += stage.info.numTasks - it.remove() - update(stage, now) + // Only update the stage if it has not finished already + if (v1.StageStatus.ACTIVE.equals(stage.status) || --- End diff -- So I went back and took a closer look and I think this isn't entirely correct (and wasn't entirely correct before either). If I remember the semantics correctly, the stage should be skipped if it is part of the job's stages, and is in the pending state when the job finishes. If it's in the active state, it should not be marked as skipped. If you do that, the update to the skipped tasks (in L358) will most certainly be wrong. So if the state is still active here, it means some event was missed. The best we can do in that case, I think, is remove it from the live stages list and update the pool data, and that's it. On a related note, if the "onStageSubmitted" event is missed, the stage will remain in the "pending" state even if tasks start on it. Perhaps that could also be added to the "onTaskStart" handler, just to be sure the stage is marked as active. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95319/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22042 **[Test build #95319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95319/testReport)** for PR 22042 at commit [`ea804cf`](https://github.com/apache/spark/commit/ea804cfe840196519cc9444be9bedf03d10aa11a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r213142394 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -130,6 +130,16 @@ private[spark] class Executor( private val urlClassLoader = createClassLoader() private val replClassLoader = addReplClassLoaderIfNeeded(urlClassLoader) + // One thread will handle loading all of the plugins on this executor --- End diff -- I guess it could be in the constructor; `Utils.loadExtensions` already provides a `SparkConf` to the constructor if one accepts it, which was the only thing I could think of. I generally dislike plugin APIs that encourage initialization in the constructor, but here, other than maybe potentially some benefit for testing, I'm not seeing a lot of differences in not having the init method after all... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22188 @gatorsmile Thanks much! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API
Github user NiharS commented on a diff in the pull request: https://github.com/apache/spark/pull/22192#discussion_r213140764 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -130,6 +130,16 @@ private[spark] class Executor( private val urlClassLoader = createClassLoader() private val replClassLoader = addReplClassLoaderIfNeeded(urlClassLoader) + // One thread will handle loading all of the plugins on this executor --- End diff -- Aside from semantics, would an `init` method be necessary instead of having the initialization logic be in the plugin's constructor? Since the class loader is going to call the constructor immediately, I figure having an `init` function would only really make a difference if we want to load the plugins right here, and then call `init` at a later point in the executor's creation. I can't think of any particular reason why we'd want to do that, unless there's specific executor structures that we want created prior to plugin initialization (although in that case we could also just move the plugin initialization further down) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22247: [SPARK-25253][PYSPARK] Refactor local connection & auth ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22247 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95303/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22247: [SPARK-25253][PYSPARK] Refactor local connection & auth ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22247 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22247: [SPARK-25253][PYSPARK] Refactor local connection & auth ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22247 **[Test build #95303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95303/testReport)** for PR 22247 at commit [`c232ec6`](https://github.com/apache/spark/commit/c232ec63f80eea05d3756feb22e53aa5a1e67d93). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22233: [SPARK-25240][SQL] Fix for a deadlock in RECOVER ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/22233#discussion_r213138024 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -671,7 +674,7 @@ case class AlterTableRecoverPartitionsCommand( val value = ExternalCatalogUtils.unescapePathName(ps(1)) if (resolver(columnName, partitionNames.head)) { scanPartitions(spark, fs, filter, st.getPath, spec ++ Map(partitionNames.head -> value), -partitionNames.drop(1), threshold, resolver) +partitionNames.drop(1), threshold, resolver, listFilesInParallel = false) --- End diff -- Does it mean there is no avaiable thread in a given thread pool when a problem try to execute a new `Future`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22188 Normally, we do not backport such improvement PRs. However, the risk of this PR is pretty small. I think it is fine. Let me do this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22233: [SPARK-25240][SQL] Fix for a deadlock in RECOVER ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22233#discussion_r213137139 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -671,7 +674,7 @@ case class AlterTableRecoverPartitionsCommand( val value = ExternalCatalogUtils.unescapePathName(ps(1)) if (resolver(columnName, partitionNames.head)) { scanPartitions(spark, fs, filter, st.getPath, spec ++ Map(partitionNames.head -> value), -partitionNames.drop(1), threshold, resolver) +partitionNames.drop(1), threshold, resolver, listFilesInParallel = false) --- End diff -- @MaxGekk could you revert to use Scala `par`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22042 **[Test build #95319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95319/testReport)** for PR 22042 at commit [`ea804cf`](https://github.com/apache/spark/commit/ea804cfe840196519cc9444be9bedf03d10aa11a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22246: [WIP] [SPARK-25235] [SHELL] Merge the REPL code in Scala...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22246 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22246: [WIP] [SPARK-25235] [SHELL] Merge the REPL code in Scala...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22246 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95304/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2596/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22188 @gatorsmile >Why 2.2 only? Only that I forgot that master is already on 2.4. We should do 2.3 as well, but I haven't tested it yet. Do I need to do anything on my end to get it into 2.2, and once I test, into 2.3? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22246: [WIP] [SPARK-25235] [SHELL] Merge the REPL code in Scala...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22246 **[Test build #95304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95304/testReport)** for PR 22246 at commit [`6203f83`](https://github.com/apache/spark/commit/6203f83008950a811b33bba97b99540716d27833). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22183: [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive field ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22183 For Hive tables, column resolution is always case insensitive. However, When `spark.sql.hive.convertMetastoreParquet` is true, users might face inconsistent behaviors when they use native parquet reader to resolve the columns in the case sensitive mode. We still introduce behavior changes. Better error messages sounds good enough, instead of disabling `spark.sql.hive.convertMetastoreParquet` when the mode is case sensitive. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213135626 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- For Hive tables, column resolution is always case insensitive. However, When `spark.sql.hive.convertMetastoreParquet` is true, users might face inconsistent behaviors when they use native parquet reader to resolve the columns in the case sensitive mode. We still introduce behavior changes. Better error messages sounds good enough, instead of disabling `spark.sql.hive.convertMetastoreParquet` when the mode is case sensitive. cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17280 **[Test build #95318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95318/testReport)** for PR 17280 at commit [`733c7ff`](https://github.com/apache/spark/commit/733c7ff70c46f0c54cdf520b44645544b810e04e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17280 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2595/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17280 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22188 @bersprockets The risk is pretty small I think. I am fine to backport it to the previous versions. Why 2.2 only? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22193: [SPARK-25186][SQL] Remove v2 save mode.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/22193 @HyukjinKwon, those changes probably don't need to be in this PR, but this is just a demonstration that we can remove `SaveMode` without changing test cases. The larger issue is that this doesn't correctly use CTAS or RTAS plans. Instead, it does things like directly deleting data. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22104 **[Test build #95317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95317/testReport)** for PR 22104 at commit [`2325a4f`](https://github.com/apache/spark/commit/2325a4f18a2bc6cc95d96bc5ac6790749b3e927e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2594/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17280 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95316/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17280 **[Test build #95316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95316/testReport)** for PR 17280 at commit [`9e2854a`](https://github.com/apache/spark/commit/9e2854a9764b7f7a007d38c3ab89f2e228c0675e). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17280 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17280 **[Test build #95316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95316/testReport)** for PR 17280 at commit [`9e2854a`](https://github.com/apache/spark/commit/9e2854a9764b7f7a007d38c3ab89f2e228c0675e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17280 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2593/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17280: [SPARK-19939] [ML] Add support for association rules in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17280 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22208: [SPARK-25216][SQL] Improve error message when a column c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22208 **[Test build #95315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95315/testReport)** for PR 22208 at commit [`a8a5976`](https://github.com/apache/spark/commit/a8a59760228d4fac54175caeffdfe07faf26a184). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22238#discussion_r213129120 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output Sinks](#output-sinks) sections f # Additional Information +**Gotchas** + +- For structured streaming, modifying "spark.sql.shuffle.partitions" is restricted once you run the query. + - This is because state is partitioned via key, hence number of partitions for state should be unchanged. + - If you want to run less tasks for stateful operations, `coalesce` would help with avoiding unnecessary repartitioning. Please note that it will also affect downstream operators. --- End diff -- It just means that the number of partitions in stateful operations' output will be same as parameter for `coalesce`, and the number of partitions will be kept unless another shuffle happens. It is implicitly same as `spark.sql.shuffle.partitions`, which default value is 200. I'll add the code, but not sure we need to have the code per language like Scala / Java / Python tabs since they will be same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22208: [SPARK-25216][SQL] Improve error message when a column c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2592/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22208: [SPARK-25216][SQL] Improve error message when a column c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22212: [SPARK-25220] Seperate kubernetes node selector config b...
Github user erikerlandson commented on the issue: https://github.com/apache/spark/pull/22212 I agree there's an argument for keeping this, but an alternative would be to leave the original for backward compatability, deprecate it, and recommend people make use of custom pod templates (#22146) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22212: [SPARK-25220] Seperate kubernetes node selector c...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/22212#discussion_r213127037 --- Diff: docs/running-on-kubernetes.md --- @@ -663,11 +663,21 @@ specific to Spark on Kubernetes. - spark.kubernetes.node.selector.[labelKey] + spark.kubernetes.driver.selector.[labelKey] --- End diff -- agreed we should keep it, but recommend annotating it as deprecated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/2 @xuanyuanking, while this does remove the hack, it doesn't address the underlying problem. The problem is that there is a single RDD, which may contain InternalRow or may contain ColumnarBatch. Generated code knows how to differentiate between the two and use the RDD contents correctly. While this is an improvement because it uses the actual type of records in the RDD, the work that needs to be done is to update the columnar case so that it does return an `RDD[InternalRow]` for anyone that accesses data using that RDD, and then update the generated code to detect a data source RDD and access the underlying `RDD[ColumnarBatch]`. Here's some pseudo-code to demonstrate what I mean. The current code does something like this with a cast. Your change wouldn't fix the need to cast to `RDD[ColumnarBatch]`: ```scala def doExecute(rdd: DataSourceRDD[InternalRow]) { // with your change, DataSourceRDD[_] if (rdd.isColumnar) { doExecuteColumnarBatch(rdd.asInstanceOf[RDD[ColumnarBatch]]) } else { doExecuteRows(rdd) } } ``` I think that should be changed to something like this which is type safe: ```scala def doExecute(rdd: DataSourceRDD[InternalRow]) { if (rdd.isColumnar) { doExecuteColumnarBatch(rdd.getColumnBatchRDD) } else { doExecuteRows(rdd) } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213126158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- Ok, tomorrow I will create a Jira and start working on it. Thanks for your comments! :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22205: [SPARK-25212][SQL] Support Filter in ConvertToLoc...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/22205#discussion_r213124828 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1349,6 +1353,12 @@ object ConvertToLocalRelation extends Rule[LogicalPlan] { case Limit(IntegerLiteral(limit), LocalRelation(output, data, isStreaming)) => LocalRelation(output, data.take(limit), isStreaming) + +case Filter(condition, LocalRelation(output, data, isStreaming)) +if !hasUnevaluableExpr(condition) => --- End diff -- I suppose it is fine in this case. The only thing is that it violates the contract of the optimizer: it should not change the results of a query. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22236 **[Test build #95314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95314/testReport)** for PR 22236 at commit [`88eb571`](https://github.com/apache/spark/commit/88eb571b732d42138b029ead106f4c8718e1e220). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22236 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22236 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2591/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22205: [SPARK-25212][SQL] Support Filter in ConvertToLocalRelat...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22205 Yes. Disable this rule for testing only. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22238#discussion_r213123711 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output Sinks](#output-sinks) sections f # Additional Information +**Gotchas** --- End diff -- I was going to add the explanation to `doc()` of `spark.sql.shuffle.partitions`, but looks like what we explained in `doc()` would not be published automatically. (Please correct me if I'm missing here.) SQLConf is even not exposed to scaladoc. That's why I'm adding this to structured streaming guide doc. Actually I think most of end users only take a look at this doc for structured streaming, and we can't (and shouldn't) expect end users to take a look at source code to find it. But also actually I didn't notice that `spark.sql.shuffle.partitions` is explained in `sql-programming-guide.md` but I also think we need to explain all configs here if they work differently with batch query. `spark.sql.shuffle.partitions` is the case. Btw, `Gotchas` looks like funny though. Maybe having section would be better. Maybe like `## Other Configuration Options` in `sql-programming-guide.md`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22205: [SPARK-25212][SQL] Support Filter in ConvertToLocalRelat...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/22205 @gatorsmile what are you afraid of exactly? We could check which tests are affected. Also do you want to disable this for testing only? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21977 **[Test build #95313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95313/testReport)** for PR 21977 at commit [`0b275cf`](https://github.com/apache/spark/commit/0b275cfea7d83cdf61802da30c4a7604be8900e4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2590/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.memory li...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r213122284 --- Diff: docs/configuration.md --- @@ -179,6 +179,15 @@ of the most common options to set are: (e.g. 2g, 8g). + + spark.executor.pyspark.memory + Not set + +The amount of memory to be allocated to PySpark in each executor, in MiB +unless otherwise specified. If set, PySpark memory for an executor will be +limited to this amount. If not set, Spark will not limit Python's memory use. --- End diff -- I've added "and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213121794 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- I like this idea too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21977#discussion_r213121178 --- Diff: docs/configuration.md --- @@ -179,6 +179,15 @@ of the most common options to set are: (e.g. 2g, 8g). + + spark.executor.pyspark.memory + Not set + +The amount of memory to be allocated to PySpark in each executor, in MiB --- End diff -- I've added "When PySpark is run in YARN, this memory is added to executor resource requests." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22104 **[Test build #95312 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95312/testReport)** for PR 22104 at commit [`3f0a97a`](https://github.com/apache/spark/commit/3f0a97a89b39d2ad57c587e49bb07203a670faba). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2589/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22206: [SPARK-25213][PYTHON] Add project to v2 scans bef...
Github user rdblue closed the pull request at: https://github.com/apache/spark/pull/22206 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213120096 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- @TomaszGaweda This sounds a good idea by returning a handler for built-in functions. cc @rxin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21546 Hey @HyukjinKwon , after going through the previous benchmarks, it seems out-of-order batches had more of an effect on performance that I thought with `toPandas`. The current revision of this PR (which buffers out of order batches in the driver JVM) has about a 1.06x - 1.09x speedup which seems a bit underwhelming after getting ~1.25x when sending out-of-order batches. I still want to try to verify the old numbers and will hopefully get to that tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22206: [SPARK-25213][PYTHON] Add project to v2 scans before pyt...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/22206 @HyukjinKwon and @viirya, thank you for looking at this commit, but I like @cloud-fan's approach to fixing this in #22244 better than this work-around. I'm going to close this in favor of that approach, although if we need a quick fix I can pick this back up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22236 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22236 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95294/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22236 **[Test build #95294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95294/testReport)** for PR 22236 at commit [`957a6a2`](https://github.com/apache/spark/commit/957a6a2cf0e05f01c2c2d602944b8da8cfb1b426). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22188: [SPARK-25164][SQL] Avoid rebuilding column and path list...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/22188 @cloud-fan @gatorsmile Should we merge this also onto 2.2? It was a clean cherry-pick for me (from master to branch-2.2), and I ran the top and bottom tests (6000 columns, 1 million rows, 67 32M files, and 60 columns, 100 million rows, 67 32M files) from the PR description and got the same results. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21638 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95295/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21638 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/7 **[Test build #95311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95311/testReport)** for PR 7 at commit [`4e10733`](https://github.com/apache/spark/commit/4e107337a47ce590c703b757b0a44d60d6b862e1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21638 **[Test build #95295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95295/testReport)** for PR 21638 at commit [`5e46efb`](https://github.com/apache/spark/commit/5e46efb5f5ce86297c4aeb23bf934fd9942de3de). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22205: [SPARK-25212][SQL] Support Filter in ConvertToLocalRelat...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22205 It would be safer to turn off this rule, since it will skip the actual query execution. Normally, the tests are introduced for testing end-to-end scenarios instead of applying this rule. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org