[GitHub] [spark] cloud-fan commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-16 Thread GitBox


cloud-fan commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r471246654



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -136,7 +139,21 @@ private[spark] class TaskSchedulerImpl(
   // IDs of the tasks running on each executor
   private val executorIdToRunningTaskIds = new HashMap[String, HashSet[Long]]
 
-  private val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+  // We add executors here when we first get decommission notification for 
them. Executors can
+  // continue to run even after being asked to decommission, but they will 
eventually exit.
+  val executorsPendingDecommission = new HashMap[String, 
ExecutorDecommissionInfo]
+
+  // When they exit and we know of that via heartbeat failure, we will add 
them to this cache.
+  // This cache is consulted to know if a fetch failure is because a source 
executor was
+  // decommissioned.
+  lazy val decommissionedExecutorsRemoved = CacheBuilder.newBuilder()
+.expireAfterWrite(
+  conf.getLong("spark.decommissioningRememberAfterRemoval.seconds", 60L), 
TimeUnit.SECONDS)

Review comment:
   Not related to this PR but just for double-check: The fetch failure may 
arrive after the executor is marked as removed, even without the "early exit" 
feature, right? For example, the RPC framework has a huge delay. So this is a 
best-effort anyway.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-16 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471246355



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   And `FileTable.inferSchema` doesn't take in options.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674671126







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674671126







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


SparkQA removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674635457


   **[Test build #127495 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127495/testReport)**
 for PR 29434 at commit 
[`5944512`](https://github.com/apache/spark/commit/594451208426eb3b30bcd9a2b4cce74f25621e8e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SaurabhChawla100 edited a comment on pull request #29413: [SPARK-32597][CORE] Tune Event Drop in Async Event Queue

2020-08-16 Thread GitBox


SaurabhChawla100 edited a comment on pull request #29413:
URL: https://github.com/apache/spark/pull/29413#issuecomment-674664896


   > Yeah so you have described the affects of it dropping the events, which I 
know. The thing I want to know is why it dropped the events in your cases.
   
   There are scenarios where in stage there are large number of task and all 
task completed in very small time resulting in the queue to fill up very fast 
and dropping for event after some time from those Queues (executormanagement 
Queue, AppStatus Queue, even in Event Log Queue). For us executormanagement and 
AppStatus matters most since we have dynamic allocation enabled and most of the 
jobs are on the Notebooks( jupyter/zeppelin)
   
   > I'm not sure the circumstances you are running or seeing these issues, is 
this just individuals running their own clusters. If it's a centralized 
cluster, why not have some cluster defaults and just raise the default values 
there? This is essentially just doing that.
   
   We are already setting some value in  cluster defaults . But as I said 
earlier also "There is no fixed size of the Queue which can be used in all the 
Spark Jobs and even for the Same Spark Job that ran on different input set on 
daily basis"
   Thats the reason we are looking for some dynamism in this process queue and 
some way to reduce human efforts to change the Queue size. IMO changing the 
driver memory(increasing / decreasing the driver memory) is easier compare to 
setting the Queue size(also there are multiple Queues) and also debugging the 
application failed due to OOM is easier to debug than debugging the impact of 
Event drop, since there are many cases to look upon this event drop 
(application hangs, wasted resources etc.)
   
   @tgravescs @Ngone51 @itskals 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-08-16 Thread GitBox


ulysses-you commented on a change in pull request #28840:
URL: https://github.com/apache/spark/pull/28840#discussion_r471244364



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala
##
@@ -236,6 +236,45 @@ case class ShowFunctionsCommand(
   }
 }
 
+
+/**
+ * A command for users to refresh the persistent function.
+ * The syntax of using this command in SQL is:
+ * {{{
+ *REFRESH FUNCTION functionName
+ * }}}
+ */
+case class RefreshFunctionCommand(
+databaseName: Option[String],
+functionName: String)
+  extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+if 
(FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) {
+  throw new AnalysisException(s"Cannot refresh builtin function 
$functionName")

Review comment:
   get it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


SparkQA commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674670309


   **[Test build #127495 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127495/testReport)**
 for PR 29434 at commit 
[`5944512`](https://github.com/apache/spark/commit/594451208426eb3b30bcd9a2b4cce74f25621e8e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-08-16 Thread GitBox


ulysses-you commented on a change in pull request #28840:
URL: https://github.com/apache/spark/pull/28840#discussion_r471244264



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala
##
@@ -236,6 +236,45 @@ case class ShowFunctionsCommand(
   }
 }
 
+
+/**
+ * A command for users to refresh the persistent function.
+ * The syntax of using this command in SQL is:
+ * {{{
+ *REFRESH FUNCTION functionName
+ * }}}
+ */
+case class RefreshFunctionCommand(
+databaseName: Option[String],
+functionName: String)
+  extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+if 
(FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) {

Review comment:
   It seems no meaning to refresh a persistent function whose name is same 
as a built-in function.
   
   Yes, we can create a persistent function with the same name as the built-in 
function, but just create in metastore. The actual function we used is the 
built-in function. The reason is built-in functions are pre-cached in registry 
and we lookup cached function first.
   
   e.g., `CREATE FUNCTION rand AS 'xxx'`, `DESC FUNCTION rand` will always 
return `Class: org.apache.spark.sql.catalyst.expressions.Rand`.
   
   BTW, maybe it's the reason why we create function and load it lazy that just 
be a Hive client, otherwise we can't create such function like `rand`,`md5` in 
metastore. @cloud-fan 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-16 Thread GitBox


cloud-fan commented on a change in pull request #29422:
URL: https://github.com/apache/spark/pull/29422#discussion_r471243906



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -294,10 +294,15 @@ private[spark] class CoarseGrainedExecutorBackend(
 override def run(): Unit = {
   var lastTaskRunningTime = System.nanoTime()
   val sleep_time = 1000 // 1s
-
+  // This config is internal and only used by unit tests to force an 
executor
+  // to hang around for longer when decommissioned.
+  val initialSleepMillis = env.conf.getInt(
+"spark.executor.decommission.initial.sleep.millis", sleep_time)

Review comment:
   for testing only conf, let's add prefix `spark.test`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-16 Thread GitBox


imback82 commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471243651



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   But it's not going to fix v2 datasources since `inferSchema` is called 
from `FileTable`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674665654







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674665654







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


SparkQA commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674665237


   **[Test build #127498 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127498/testReport)**
 for PR 29395 at commit 
[`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SaurabhChawla100 commented on pull request #29413: [SPARK-32597][CORE] Tune Event Drop in Async Event Queue

2020-08-16 Thread GitBox


SaurabhChawla100 commented on pull request #29413:
URL: https://github.com/apache/spark/pull/29413#issuecomment-674664896


   > Yeah so you have described the affects of it dropping the events, which I 
know. The thing I want to know is why it dropped the events in your cases.
   
   There are scenarios where in stage there are large number of task and all 
task completed in very small time resulting in the queue to fill up very fast 
and dropping for event after some time from those Queues (executormanagement 
Queue, AppStatus Queue, even in Event Log Queue). For us executormanagement and 
AppStatus matters most since we have dynamic allocation enabled and most of the 
jobs are on the Notebooks( jupyter/zeppelin)
   
   > I'm not sure the circumstances you are running or seeing these issues, is 
this just individuals running their own clusters. If it's a centralized 
cluster, why not have some cluster defaults and just raise the default values 
there? This is essentially just doing that.
   
   We are already setting some value in  cluster defaults . But as I said 
earlier also "There is no fixed size of the Queue which can be used in all the 
Spark Jobs and even for the Same Spark Job that ran on different input set on 
daily basis"
   Thats the reason we are looking for some dynamism in this process queue and 
some way to reduce human efforts to change the Queue size. Also IMO changing 
the driver memory(increasing / decreasing the driver memory) is easier compare 
to setting the Queue size(also there are multiple Queues).
   
   @tgravescs @Ngone51 @itskals 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 edited a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


Ngone51 edited a comment on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674664611


   `org.apache.spark.sql.DataFrameSuite.SPARK-28224: Aggregate sum big decimal 
overflow` is quite flaky. But I'm sure it can pass on my laptop.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


Ngone51 commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674664662


   retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674663941


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127496/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


Ngone51 commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674664611


   `org.apache.spark.sql.DataFrameSuite.SPARK-28224: Aggregate sum big decimal 
overflow` is pretty flaky. But I'm sure it can pass on my laptop.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674663938


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


SparkQA removed a comment on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674637079


   **[Test build #127496 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127496/testReport)**
 for PR 29395 at commit 
[`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-16 Thread GitBox


cloud-fan commented on a change in pull request #29437:
URL: https://github.com/apache/spark/pull/29437#discussion_r471239203



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
##
@@ -64,7 +64,9 @@ abstract class CSVDataSource extends Serializable {
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
 if (inputPaths.nonEmpty) {
-  Some(infer(sparkSession, inputPaths, parsedOptions))
+  // Remove "path" option so that it is not added to the given 
`inputPaths`.

Review comment:
   Can we do it at the caller side of `FileFormat.inferSchema`? Then we 
don't need to change various file source implementations.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674663938







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


SparkQA commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674663650


   **[Test build #127496 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127496/testReport)**
 for PR 29395 at commit 
[`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-16 Thread GitBox


Ngone51 commented on a change in pull request #29270:
URL: https://github.com/apache/spark/pull/29270#discussion_r471236995



##
File path: sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
##
@@ -256,4 +285,39 @@ trait TPCDSSchema {
  |${options.mkString("\n")}
""".stripMargin)
   }
+
+  private val originalCBCEnabled = conf.cboEnabled
+  private val originalPlanStatsEnabled = conf.planStatsEnabled
+  private val originalJoinReorderEnabled = conf.joinReorderEnabled
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+if (injectStats) {
+  // Sets configurations for enabling the optimization rules that
+  // exploit data statistics.
+  conf.setConf(SQLConf.CBO_ENABLED, true)
+  conf.setConf(SQLConf.PLAN_STATS_ENABLED, true)
+  conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true)

Review comment:
   The configurations have been extracted to the base class(`TPCDSBase`). 
If we change it here, I think it will also take effect for the  
`TPCDSQuerySuite`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class

2020-08-16 Thread GitBox


cloud-fan commented on a change in pull request #29322:
URL: https://github.com/apache/spark/pull/29322#discussion_r471236515



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala
##
@@ -453,60 +453,81 @@ class ComplexTypesSuite extends PlanTest with 
ExpressionEvalHelper {
 checkEvaluation(GetMapValue(mb0, Literal(Array[Byte](3, 4))), null)
   }
 
-  private val structAttr = 'struct1.struct('a.int)
+  private val structAttr = 'struct1.struct('a.int, 'b.int)
   private val testStructRelation = LocalRelation(structAttr)
 
-  test("simplify GetStructField on WithFields that is not changing the 
attribute being extracted") {
-val query = testStructRelation.select(
-  GetStructField(WithFields('struct1, Seq("b"), Seq(Literal(1))), 0, 
Some("a")) as "outerAtt")
-val expected = testStructRelation.select(GetStructField('struct1, 0, 
Some("a")) as "outerAtt")
-checkRule(query, expected)
+  test("simplify GetStructField on UpdateFields that is not modifying the 
attribute being " +
+"extracted") {
+// add attribute, extract an attribute from the original struct
+val query1 = 
testStructRelation.select(GetStructField(UpdateFields('struct1,
+  WithField("b", Literal(1)) :: Nil), 0, None) as "outerAtt")
+// drop attribute, extract an attribute from the original struct
+val query2 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") 
::
+  Nil), 0, None) as "outerAtt")
+// drop attribute, add attribute, extract an attribute from the original 
struct
+val query3 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") 
::
+  WithField("c", Literal(2)) :: Nil), 0, None) as "outerAtt")
+// drop attribute, add attribute, extract an attribute from the original 
struct
+val query4 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("a") 
::
+  WithField("a", Literal(1)) :: Nil), 0, None) as "outerAtt")
+val expected = testStructRelation.select(GetStructField('struct1, 0, None) 
as "outerAtt")

Review comment:
   thanks for catching it! I've reverted it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-16 Thread GitBox


cloud-fan commented on a change in pull request #29270:
URL: https://github.com/apache/spark/pull/29270#discussion_r471236200



##
File path: sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
##
@@ -256,4 +285,39 @@ trait TPCDSSchema {
  |${options.mkString("\n")}
""".stripMargin)
   }
+
+  private val originalCBCEnabled = conf.cboEnabled
+  private val originalPlanStatsEnabled = conf.planStatsEnabled
+  private val originalJoinReorderEnabled = conf.joinReorderEnabled
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+if (injectStats) {
+  // Sets configurations for enabling the optimization rules that
+  // exploit data statistics.
+  conf.setConf(SQLConf.CBO_ENABLED, true)
+  conf.setConf(SQLConf.PLAN_STATS_ENABLED, true)
+  conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true)

Review comment:
   Let's do it later. The `TPCDSQuerySuite` doesn't test shuffle hash join 
either and we should fix it as well.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-16 Thread GitBox


Ngone51 commented on pull request #29270:
URL: https://github.com/apache/spark/pull/29270#issuecomment-674656885


   > Then I think we don't need the tag.
   
   Ok.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-16 Thread GitBox


cloud-fan commented on pull request #29270:
URL: https://github.com/apache/spark/pull/29270#issuecomment-674651592


   > The total running time with or without the annotation are both about 2 mins
   
   Then I think we don't need the tag.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29448:
URL: https://github.com/apache/spark/pull/29448#issuecomment-674651219







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29448:
URL: https://github.com/apache/spark/pull/29448#issuecomment-674651219







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite

2020-08-16 Thread GitBox


SparkQA commented on pull request #29448:
URL: https://github.com/apache/spark/pull/29448#issuecomment-674650965


   **[Test build #127497 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127497/testReport)**
 for PR 29448 at commit 
[`69278b2`](https://github.com/apache/spark/commit/69278b29c65f34f073efcf7c6e463b3a20dc).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang edited a comment on pull request #29404: [SPARK-32018][SQL][FollowUp][3.0] Throw exception on decimal value overflow of sum aggregation

2020-08-16 Thread GitBox


gengliangwang edited a comment on pull request #29404:
URL: https://github.com/apache/spark/pull/29404#issuecomment-674650147


   @maropu Thanks for reporting. I have created 
https://github.com/apache/spark/pull/29448 to fix the test failure.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite

2020-08-16 Thread GitBox


gengliangwang commented on pull request #29448:
URL: https://github.com/apache/spark/pull/29448#issuecomment-674650068


   cc @maropu @cloud-fan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request #29448: [SPARK-32018][SQL][3.0][FOLLOWUP] Fix a test failure in DataFrameSuite

2020-08-16 Thread GitBox


gengliangwang opened a new pull request #29448:
URL: https://github.com/apache/spark/pull/29448


   
   
   ### What changes were proposed in this pull request?
   
   Fix a test failure in DataFrameSuite introduced by 
https://github.com/apache/spark/pull/29404
   
   ### Why are the changes needed?
   
   Fix test failure
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Unit test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Udbhav30 closed pull request #29441: [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache

2020-08-16 Thread GitBox


Udbhav30 closed pull request #29441:
URL: https://github.com/apache/spark/pull/29441


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Udbhav30 commented on a change in pull request #29441: [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache

2020-08-16 Thread GitBox


Udbhav30 commented on a change in pull request #29441:
URL: https://github.com/apache/spark/pull/29441#discussion_r471226995



##
File path: core/src/main/scala/org/apache/spark/rdd/RDD.scala
##
@@ -388,11 +388,9 @@ abstract class RDD[T: ClassTag](
   // Block hit.
   case Left(blockResult) =>
 if (readCachedBlock) {
-  val existingMetrics = context.taskMetrics().inputMetrics
-  existingMetrics.incBytesRead(blockResult.bytes)

Review comment:
   Thanks for the information @viirya. I'll close the pr





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #29404: [SPARK-32018][SQL][FollowUp][3.0] Throw exception on decimal value overflow of sum aggregation

2020-08-16 Thread GitBox


gengliangwang commented on pull request #29404:
URL: https://github.com/apache/spark/pull/29404#issuecomment-674650147


   @maropu Thanks for reporting. I have created 
https://github.com/apache/spark/pull/29448.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


LuciferYang commented on a change in pull request #29434:
URL: https://github.com/apache/spark/pull/29434#discussion_r471213921



##
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
##
@@ -314,7 +314,7 @@ trait Row extends Serializable {
*
* @throws ClassCastException when data type does not match.
*/
-  def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i)
+  def getSeq[T](i: Int): scala.collection.Seq[T] = 
getAs[scala.collection.Seq[T]](i)

Review comment:
   Address 
[5944512](https://github.com/apache/spark/pull/29434/commits/594451208426eb3b30bcd9a2b4cce74f25621e8e)
 revert to use `Seq` alias as `Row.getSeq` method return type.
   
   The internal `ArrayType -> scala.collection.Seq` encode mode remain 
unchanged, I think we can change it to `ArrayType -> 
scala.collection.immutable.Seq` when we no longer need to be compatible with 
Scala 2.12.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674637326







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674637326







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


SparkQA commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674637079


   **[Test build #127496 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127496/testReport)**
 for PR 29395 at commit 
[`9c18479`](https://github.com/apache/spark/commit/9c18479e0b71c5b6ec1a2a0f268c598cf03fa879).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a change in pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


LuciferYang commented on a change in pull request #29434:
URL: https://github.com/apache/spark/pull/29434#discussion_r471213921



##
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
##
@@ -314,7 +314,7 @@ trait Row extends Serializable {
*
* @throws ClassCastException when data type does not match.
*/
-  def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i)
+  def getSeq[T](i: Int): scala.collection.Seq[T] = 
getAs[scala.collection.Seq[T]](i)

Review comment:
   Address 
[5944512](https://github.com/apache/spark/pull/29434/commits/594451208426eb3b30bcd9a2b4cce74f25621e8e)
 revert to use `Seq` alias as `Row.getSeq` method return type.
   
   The internal `ArrayType -> scala.collection.Seq` encode mode remain unchanged





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674635770







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674635770







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #29395: [3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-16 Thread GitBox


Ngone51 commented on pull request #29395:
URL: https://github.com/apache/spark/pull/29395#issuecomment-674635748


   retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29434: [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13

2020-08-16 Thread GitBox


SparkQA commented on pull request #29434:
URL: https://github.com/apache/spark/pull/29434#issuecomment-674635457


   **[Test build #127495 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127495/testReport)**
 for PR 29434 at commit 
[`5944512`](https://github.com/apache/spark/commit/594451208426eb3b30bcd9a2b4cce74f25621e8e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29225: [SPARK-32287][TESTS] Flaky Test: ExecutorAllocationManagerSuite.add executors default profile

2020-08-16 Thread GitBox


Ngone51 commented on a change in pull request #29225:
URL: https://github.com/apache/spark/pull/29225#discussion_r471211841



##
File path: 
core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala
##
@@ -1603,7 +1603,7 @@ class ExecutorAllocationManagerSuite extends 
SparkFunSuite {
   .set(config.DYN_ALLOCATION_TESTING, true)
   // SPARK-22864: effectively disable the allocation schedule by setting 
the period to a
   // really long value.
-  .set(TEST_SCHEDULE_INTERVAL, 1L)
+  .set(TEST_SCHEDULE_INTERVAL, 3L)

Review comment:
   @tgravescs  It happens again at: 
https://github.com/apache/spark/pull/29418/checks?check_run_id=975304054
   
   BTW, this change only prevents the automatic invocation of `schedule()` for 
the second time but the first time invocation of `schedule()` always happens 
because the `initialDelay` is 0?
   
   
https://github.com/apache/spark/blob/8f0fef18438aa8fb07f5ed885ffad1339992f102/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L250





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

2020-08-16 Thread GitBox


wangyum commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-674633117


   Thank you all. There is a workaround:
   
   
![image](https://user-images.githubusercontent.com/5399861/90353884-22a39800-e07a-11ea-8a7b-37ddb1689fd6.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum closed pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

2020-08-16 Thread GitBox


wangyum closed pull request #29088:
URL: https://github.com/apache/spark/pull/29088


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #29224:
URL: https://github.com/apache/spark/pull/29224#discussion_r471209639



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
##
@@ -106,13 +109,28 @@ class SparkSessionExtensions {
 columnarRuleBuilders.map(_.apply(session)).toSeq
   }
 
+  /**
+   * Build the override rules for the query stage preparation phase of 
adaptive query execution.
+   */
+  private[sql] def buildQueryStagePrepRules(session: SparkSession): 
Seq[Rule[SparkPlan]] = {
+queryStagePrepRuleBuilders.map(_.apply(session)).toSeq
+  }
+
   /**
* Inject a rule that can override the columnar execution of an executor.
*/
   def injectColumnar(builder: ColumnarRuleBuilder): Unit = {
 columnarRuleBuilders += builder
   }
 
+  /**
+   * Inject a rule that can override the the query stage preparation phase of 
adaptive query

Review comment:
   Nit: the the => the 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #29224:
URL: https://github.com/apache/spark/pull/29224#discussion_r471209521



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
##
@@ -96,8 +97,10 @@ class SparkSessionExtensions {
   type ParserBuilder = (SparkSession, ParserInterface) => ParserInterface
   type FunctionDescription = (FunctionIdentifier, ExpressionInfo, 
FunctionBuilder)
   type ColumnarRuleBuilder = SparkSession => ColumnarRule
+  type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan]

Review comment:
   This is a public API. I think we also need to add a version information.
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #29224:
URL: https://github.com/apache/spark/pull/29224#discussion_r471209398



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
##
@@ -96,8 +96,10 @@ class SparkSessionExtensions {
   type ParserBuilder = (SparkSession, ParserInterface) => ParserInterface
   type FunctionDescription = (FunctionIdentifier, ExpressionInfo, 
FunctionBuilder)
   type ColumnarRuleBuilder = SparkSession => ColumnarRule
+  type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan]

Review comment:
   This is a public API. I think we also need to add a version information. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-16 Thread GitBox


Ngone51 commented on pull request #29270:
URL: https://github.com/apache/spark/pull/29270#issuecomment-674631270


   > Have you checked the running time? 
https://github.com/apache/spark/pull/29270/files#r469675879
   
   I may not fully understand the usage of `@ExtendedSQLTest`. The total 
running time with or without the annotation are both about 2 mins(I used the 
same command). We may not need it? @maropu 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #29224:
URL: https://github.com/apache/spark/pull/29224#discussion_r471208359



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala
##
@@ -73,7 +74,8 @@ private[sql] class SessionState(
 resourceLoaderBuilder: () => SessionResourceLoader,
 createQueryExecution: LogicalPlan => QueryExecution,
 createClone: (SparkSession, SessionState) => SessionState,
-val columnarRules: Seq[ColumnarRule]) {
+val columnarRules: Seq[ColumnarRule],
+val queryStagePrepRules: Seq[Rule[SparkPlan]]) {

Review comment:
   SessionState is a critical concept/class in Spark SQL. Adding 
`queryStagePrepRules` into SessionState looks weird to me. WDYT?
   
   cc @hvanhovell 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #29224:
URL: https://github.com/apache/spark/pull/29224#discussion_r471207785



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -90,7 +90,7 @@ case class AdaptiveSparkPlanExec(
   // Exchange nodes) after running these rules.
   private def queryStagePreparationRules: Seq[Rule[SparkPlan]] = Seq(

Review comment:
   It sounds like this function should be moved out of the physical node 
`AdaptiveSparkPlanExec`? cc @maryannxue @cloud-fan 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29224: [SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #29224:
URL: https://github.com/apache/spark/pull/29224#discussion_r471207785



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
##
@@ -90,7 +90,7 @@ case class AdaptiveSparkPlanExec(
   // Exchange nodes) after running these rules.
   private def queryStagePreparationRules: Seq[Rule[SparkPlan]] = Seq(

Review comment:
   It sounds like these rules should be moved out of the physical node 
`AdaptiveSparkPlanExec`? cc @maryannxue @cloud-fan 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #29270: [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression

2020-08-16 Thread GitBox


Ngone51 commented on a change in pull request #29270:
URL: https://github.com/apache/spark/pull/29270#discussion_r471203868



##
File path: sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
##
@@ -256,4 +285,39 @@ trait TPCDSSchema {
  |${options.mkString("\n")}
""".stripMargin)
   }
+
+  private val originalCBCEnabled = conf.cboEnabled
+  private val originalPlanStatsEnabled = conf.planStatsEnabled
+  private val originalJoinReorderEnabled = conf.joinReorderEnabled
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+if (injectStats) {
+  // Sets configurations for enabling the optimization rules that
+  // exploit data statistics.
+  conf.setConf(SQLConf.CBO_ENABLED, true)
+  conf.setConf(SQLConf.PLAN_STATS_ENABLED, true)
+  conf.setConf(SQLConf.JOIN_REORDER_ENABLED, true)

Review comment:
   hmm..I'm not sure about this. but it seems `preferSortMergeJoin` is 
enabled by default? Or you mean we should test both `preferSortMergeJoin` 
enabled and disabled?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-16 Thread GitBox


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-674625909


   hi. Can anyone deal with this pr?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] fqaiser94 commented on a change in pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class

2020-08-16 Thread GitBox


fqaiser94 commented on a change in pull request #29322:
URL: https://github.com/apache/spark/pull/29322#discussion_r471198235



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala
##
@@ -453,60 +453,81 @@ class ComplexTypesSuite extends PlanTest with 
ExpressionEvalHelper {
 checkEvaluation(GetMapValue(mb0, Literal(Array[Byte](3, 4))), null)
   }
 
-  private val structAttr = 'struct1.struct('a.int)
+  private val structAttr = 'struct1.struct('a.int, 'b.int)
   private val testStructRelation = LocalRelation(structAttr)
 
-  test("simplify GetStructField on WithFields that is not changing the 
attribute being extracted") {
-val query = testStructRelation.select(
-  GetStructField(WithFields('struct1, Seq("b"), Seq(Literal(1))), 0, 
Some("a")) as "outerAtt")
-val expected = testStructRelation.select(GetStructField('struct1, 0, 
Some("a")) as "outerAtt")
-checkRule(query, expected)
+  test("simplify GetStructField on UpdateFields that is not modifying the 
attribute being " +
+"extracted") {
+// add attribute, extract an attribute from the original struct
+val query1 = 
testStructRelation.select(GetStructField(UpdateFields('struct1,
+  WithField("b", Literal(1)) :: Nil), 0, None) as "outerAtt")
+// drop attribute, extract an attribute from the original struct
+val query2 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") 
::
+  Nil), 0, None) as "outerAtt")
+// drop attribute, add attribute, extract an attribute from the original 
struct
+val query3 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") 
::
+  WithField("c", Literal(2)) :: Nil), 0, None) as "outerAtt")
+// drop attribute, add attribute, extract an attribute from the original 
struct
+val query4 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("a") 
::
+  WithField("a", Literal(1)) :: Nil), 0, None) as "outerAtt")
+val expected = testStructRelation.select(GetStructField('struct1, 0, None) 
as "outerAtt")

Review comment:
   @cloud-fan I'm afraid I have to recommend we revert the changes in this 
PR from master. 
   There is another correctness issue I've discovered which exists even in our 
initial `withField` implementation:
   ```
   sql("SELECT CAST(NULL AS struct) struct_col")
   .select($"struct_col".withField("d", lit(4)).getField("d").as("d"))
   
   // currently returns this
   +---+
   |d  |
   +---+
   |4  |
   +---+
   
   // when in fact it should return this: 
   ++
   |d   |
   ++
   |null|
   ++
   ```
   I'd like to get `withField` right before coming back to `dropFields`. 
   If that sounds good to you, here is the order of PRs I propose: 
   1. Revert `dropFields` changes immediately
   2. Fix the above issue in `withField` (the real issue is in the optimizer 
rule) through another PR asap
   3. Figure out how to implement `dropFields` in another PR in the future. I 
think I have to go back to the drawing board for this one . The combination of 
requirements (return null for null input struct, allow nested API calls, allow 
users to arbitrarily mix `dropFields` and `withField` and `getField` in the 
same query and still generate reasonably well-optimized physical plans in a 
timely basis) is making for a slightly trickier programming exercise than I 
anticipated and needs a more robust testing strategy.  
   
   Please let me know your thoughts. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] fqaiser94 commented on a change in pull request #29322: [SPARK-32511][SQL] Add dropFields method to Column class

2020-08-16 Thread GitBox


fqaiser94 commented on a change in pull request #29322:
URL: https://github.com/apache/spark/pull/29322#discussion_r471198235



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala
##
@@ -453,60 +453,81 @@ class ComplexTypesSuite extends PlanTest with 
ExpressionEvalHelper {
 checkEvaluation(GetMapValue(mb0, Literal(Array[Byte](3, 4))), null)
   }
 
-  private val structAttr = 'struct1.struct('a.int)
+  private val structAttr = 'struct1.struct('a.int, 'b.int)
   private val testStructRelation = LocalRelation(structAttr)
 
-  test("simplify GetStructField on WithFields that is not changing the 
attribute being extracted") {
-val query = testStructRelation.select(
-  GetStructField(WithFields('struct1, Seq("b"), Seq(Literal(1))), 0, 
Some("a")) as "outerAtt")
-val expected = testStructRelation.select(GetStructField('struct1, 0, 
Some("a")) as "outerAtt")
-checkRule(query, expected)
+  test("simplify GetStructField on UpdateFields that is not modifying the 
attribute being " +
+"extracted") {
+// add attribute, extract an attribute from the original struct
+val query1 = 
testStructRelation.select(GetStructField(UpdateFields('struct1,
+  WithField("b", Literal(1)) :: Nil), 0, None) as "outerAtt")
+// drop attribute, extract an attribute from the original struct
+val query2 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") 
::
+  Nil), 0, None) as "outerAtt")
+// drop attribute, add attribute, extract an attribute from the original 
struct
+val query3 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("b") 
::
+  WithField("c", Literal(2)) :: Nil), 0, None) as "outerAtt")
+// drop attribute, add attribute, extract an attribute from the original 
struct
+val query4 = 
testStructRelation.select(GetStructField(UpdateFields('struct1, DropField("a") 
::
+  WithField("a", Literal(1)) :: Nil), 0, None) as "outerAtt")
+val expected = testStructRelation.select(GetStructField('struct1, 0, None) 
as "outerAtt")

Review comment:
   @cloud-fan I'm afraid I have to recommend we revert the changes in this 
PR from master. 
   There is another correctness issue I've discovered which exists even in our 
initial `withField` implementation:
   ```
   // The following query
   sql("SELECT CAST(NULL AS struct) struct_col")
   .select($"struct_col".withField("d", lit(4)).getField("d").as("d"))
   
   // currently returns this
   +---+
   |d  |
   +---+
   |4  |
   +---+
   
   // when in fact it should return this: 
   ++
   |d   |
   ++
   |null|
   ++
   ```
   I'd like to get `withField` right before coming back to `dropFields`. 
   If that sounds good to you, here is the order of PRs I propose: 
   1. Revert `dropFields` changes immediately
   2. Fix the above issue in `withField` (the real issue is in the optimizer 
rule) through another PR asap
   3. Figure out how to implement `dropFields` in another PR in the future. I 
think I have to go back to the drawing board for this one . The combination of 
requirements (return null for null input struct, allow nested API calls, allow 
users to mix `dropFields` and `withField` and `getField` in the same query and 
still generate reasonably well-optimized physical plans in a timely basis) is 
making for a slightly trickier programming exercise than I anticipated and 
needs a more robust testing strategy.  
   
   Please let me know your thoughts. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brandonJY edited a comment on pull request #29443: [MINOR][DOCS] fix typo in couple docs

2020-08-16 Thread GitBox


brandonJY edited a comment on pull request #29443:
URL: https://github.com/apache/spark/pull/29443#issuecomment-674618402


   > Ah, I see. I've check the other PR and are only three typos in the 
codebase?
   
   no. For the codebase typo, that is not the complete list (I don't have it 
yet). And it might result mass merge conflict to other PRs  if we pack all the 
code typo corrections in one PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brandonJY edited a comment on pull request #29443: [MINOR][DOCS] fix typo in couple docs

2020-08-16 Thread GitBox


brandonJY edited a comment on pull request #29443:
URL: https://github.com/apache/spark/pull/29443#issuecomment-674618402


   > Ah, I see. I've check the other PR and are only three typos in the 
codebase?
   
   no. For the codebase typo, that is not the complete list. It might result 
mass merge conflict to other PRs  if we pack all the code typo corrections in 
one PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on pull request #29422: [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite

2020-08-16 Thread GitBox


agrawaldevesh commented on pull request #29422:
URL: https://github.com/apache/spark/pull/29422#issuecomment-674618506


   > PySpark failure is likely unrelated. I'm going to go ahead and said LGTM 
but let's give it until Monday to see if anyone else wants to review.
   
   Sounds good. Thanks @holdenk !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brandonJY commented on pull request #29443: [MINOR][DOCS] fix typo in couple docs

2020-08-16 Thread GitBox


brandonJY commented on pull request #29443:
URL: https://github.com/apache/spark/pull/29443#issuecomment-674618402


   > Ah, I see. I've check the other PR and are only three typos in the 
codebase?
   
   no. For the codebase typo, that is not the complete list. It might result 
mass merge conflict to other PRs ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29447:
URL: https://github.com/apache/spark/pull/29447#issuecomment-674618061


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29447:
URL: https://github.com/apache/spark/pull/29447#issuecomment-674618302


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leanken commented on pull request #29431: [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-16 Thread GitBox


leanken commented on pull request #29431:
URL: https://github.com/apache/spark/pull/29431#issuecomment-674618015


   @cloud-fan Test passed, ready to merge.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29447:
URL: https://github.com/apache/spark/pull/29447#issuecomment-674618061


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tianhanhu opened a new pull request #29447: [SPARK-32631][SQL] Handle Null error message in hive ThriftServer UI

2020-08-16 Thread GitBox


tianhanhu opened a new pull request #29447:
URL: https://github.com/apache/spark/pull/29447


   ### What changes were proposed in this pull request?
   This fix is trying to prevent NullPointerException caused by Null error 
message by wrapping the message with Option and using getOrElse to handle Null 
value.
   
   The possible reason why the error message would be null is that java 
Throwable allows the detailMessage to be null. However, in the render code we 
assume that the error message would be be null and try to do indexOf() on the 
string. Therefore, if some exception doesn't set the error message, it would 
trigger NullPointerException in the hive ThriftServer UI.
   
   
   ### Why are the changes needed?
   To prevent NullPointerException caused by Null error message.
   
   
   ### Does this PR introduce _any_ user-facing change?
   User would see blank message instead of a NullPointerException.
   
   
   ### How was this patch tested?
   Manual Testing.
   
   For normal behavior, this change would not change anything.
   
![37435401-658510a8-27a0-11e8-8f8b-00271fd3ef58](https://user-images.githubusercontent.com/30429546/90350571-93f83200-e00b-11ea-9ff3-3b315d2114cc.png)
   
   To test the null case, we manually set detail to be null in code. In this 
case,
   
![37435498-cebc8182-27a0-11e8-80e7-40df2a65612b](https://user-images.githubusercontent.com/30429546/90350601-ac684c80-e00b-11ea-8375-1911f5852bd5.png)
   UI would show blank.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674615464







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674615464







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29396: [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-16 Thread GitBox


SparkQA commented on pull request #29396:
URL: https://github.com/apache/spark/pull/29396#issuecomment-674615193


   **[Test build #127494 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127494/testReport)**
 for PR 29396 at commit 
[`dc8c28c`](https://github.com/apache/spark/commit/dc8c28c134c9aa311ac6c76be407e10d764755e6).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674611954







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


SparkQA removed a comment on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674608241


   **[Test build #127493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127493/testReport)**
 for PR 29445 at commit 
[`b662ee0`](https://github.com/apache/spark/commit/b662ee0a434bb37d99df1546b3f8f94e01b1f610).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


SparkQA commented on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674611838


   **[Test build #127493 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127493/testReport)**
 for PR 29445 at commit 
[`b662ee0`](https://github.com/apache/spark/commit/b662ee0a434bb37d99df1546b3f8f94e01b1f610).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674611954







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


viirya commented on a change in pull request #29445:
URL: https://github.com/apache/spark/pull/29445#discussion_r471189589



##
File path: python/pyspark/ml/tuning.py
##
@@ -536,7 +536,7 @@ def copy(self, extra=None):
 bestModel = self.bestModel.copy(extra)
 avgMetrics = self.avgMetrics
 subModels = self.subModels
-return CrossValidatorModel(bestModel, avgMetrics, subModels)
+return self._copyValues(CrossValidatorModel(bestModel, avgMetrics, 
subModels), extra=extra)

Review comment:
   Should we use `Params.copy` like `CrossValidator`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674608573







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674608573







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


SparkQA commented on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674608241


   **[Test build #127493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127493/testReport)**
 for PR 29445 at commit 
[`b662ee0`](https://github.com/apache/spark/commit/b662ee0a434bb37d99df1546b3f8f94e01b1f610).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674607134







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


viirya commented on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674607222


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674607134







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


huaxingao commented on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674607031


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29445: [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write()

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29445:
URL: https://github.com/apache/spark/pull/29445#issuecomment-674530895


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #29441: [SPARK-32626][CORE] Do not increase the input metrics when read rdd from cache

2020-08-16 Thread GitBox


maropu commented on a change in pull request #29441:
URL: https://github.com/apache/spark/pull/29441#discussion_r471172827



##
File path: core/src/main/scala/org/apache/spark/rdd/RDD.scala
##
@@ -388,11 +388,9 @@ abstract class RDD[T: ClassTag](
   // Block hit.
   case Left(blockResult) =>
 if (readCachedBlock) {
-  val existingMetrics = context.taskMetrics().inputMetrics
-  existingMetrics.incBytesRead(blockResult.bytes)

Review comment:
   +1 on the @viirya comment.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #28840:
URL: https://github.com/apache/spark/pull/28840#discussion_r471172044



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
##
@@ -3030,6 +3030,49 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
   }
 }
   }
+
+  test("REFRESH FUNCTION") {
+val msg = intercept[AnalysisException] {
+  sql("REFRESH FUNCTION md5")
+}.getMessage
+assert(msg.contains("Cannot refresh builtin function"))
+
+withUserDefinedFunction("func1" -> true) {
+  sql("CREATE TEMPORARY FUNCTION func1 AS 
'test.org.apache.spark.sql.MyDoubleAvg'")
+  val msg = intercept[AnalysisException] {
+sql("REFRESH FUNCTION func1")
+  }.getMessage
+  assert(msg.contains("Cannot refresh temporary function"))
+}
+
+withUserDefinedFunction("func1" -> false) {
+  intercept[NoSuchFunctionException] {
+sql("REFRESH FUNCTION func1")
+  }
+
+  val func = FunctionIdentifier("func1", Some("default"))
+  sql("CREATE FUNCTION func1 AS 'test.org.apache.spark.sql.MyDoubleAvg'")
+  assert(!spark.sessionState.catalog.isRegisteredFunction(func))
+  sql("REFRESH FUNCTION func1")

Review comment:
   This is the only positive test case. Could you think more and try to 
cover more cases?
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #28840:
URL: https://github.com/apache/spark/pull/28840#discussion_r471172006



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala
##
@@ -236,6 +236,45 @@ case class ShowFunctionsCommand(
   }
 }
 
+
+/**
+ * A command for users to refresh the persistent function.
+ * The syntax of using this command in SQL is:
+ * {{{
+ *REFRESH FUNCTION functionName
+ * }}}
+ */
+case class RefreshFunctionCommand(
+databaseName: Option[String],
+functionName: String)
+  extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+if 
(FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) {
+  throw new AnalysisException(s"Cannot refresh builtin function 
$functionName")

Review comment:
   Nit: built-in





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #28840:
URL: https://github.com/apache/spark/pull/28840#discussion_r471171963



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala
##
@@ -236,6 +236,45 @@ case class ShowFunctionsCommand(
   }
 }
 
+
+/**
+ * A command for users to refresh the persistent function.
+ * The syntax of using this command in SQL is:
+ * {{{
+ *REFRESH FUNCTION functionName
+ * }}}
+ */
+case class RefreshFunctionCommand(
+databaseName: Option[String],
+functionName: String)
+  extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+if 
(FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) {

Review comment:
   We still can create persistent function with the same name as the 
built-in function. For example, 
   
   ```SQL
   CREATE FUNCTION rand AS 'org.apache.spark.sql.catalyst.expressions.Abs'
   DESC function default.rand
   ```
   
   I think we should still allow this case. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-16 Thread GitBox


c21 commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674590117


   Thank you @maropu, @cloud-fan, @agrawaldevesh, and @viirya for all these 
insightful discussion and review!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-16 Thread GitBox


maropu commented on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-674589592


   Thanks, @c21, for the great improvement! Merged to master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu closed pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

2020-08-16 Thread GitBox


maropu closed pull request #29342:
URL: https://github.com/apache/spark/pull/29342


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-16 Thread GitBox


AmplabJenkins commented on pull request #29366:
URL: https://github.com/apache/spark/pull/29366#issuecomment-674584858







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-16 Thread GitBox


AmplabJenkins removed a comment on pull request #29366:
URL: https://github.com/apache/spark/pull/29366#issuecomment-674584858







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-16 Thread GitBox


SparkQA commented on pull request #29366:
URL: https://github.com/apache/spark/pull/29366#issuecomment-674584683


   **[Test build #127491 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127491/testReport)**
 for PR 29366 at commit 
[`6c878ba`](https://github.com/apache/spark/commit/6c878ba62abdb6255b842d59201d712f4c5eadf3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29366: [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps

2020-08-16 Thread GitBox


SparkQA removed a comment on pull request #29366:
URL: https://github.com/apache/spark/pull/29366#issuecomment-674555103


   **[Test build #127491 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127491/testReport)**
 for PR 29366 at commit 
[`6c878ba`](https://github.com/apache/spark/commit/6c878ba62abdb6255b842d59201d712f4c5eadf3).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on a change in pull request #29425: [SPARK-32350][FOLLOW-UP] Divide the value list into a set of small batches when bulk writing to LevelDB

2020-08-16 Thread GitBox


gatorsmile commented on a change in pull request #29425:
URL: https://github.com/apache/spark/pull/29425#discussion_r471163708



##
File path: 
common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java
##
@@ -164,35 +165,39 @@ public void writeAll(List values) throws Exception {
 Preconditions.checkArgument(values != null && !values.isEmpty(),
   "Non-empty values required.");
 
-// Group by class, in case there are values from different classes in the 
values
+// Group by class, in case there are values from different classes in the 
values.
 // Typical usecase is for this to be a single class.
 // A NullPointerException will be thrown if values contain null object.
 for (Map.Entry, ? extends List> entry :
 
values.stream().collect(Collectors.groupingBy(Object::getClass)).entrySet()) {
-
-  final Iterator valueIter = entry.getValue().iterator();
-  final Iterator serializedValueIter;
-
-  // Deserialize outside synchronized block
-  List list = new ArrayList<>(entry.getValue().size());
-  for (Object value : values) {
-list.add(serializer.serialize(value));
-  }
-  serializedValueIter = list.iterator();
-
   final Class klass = entry.getKey();
-  final LevelDBTypeInfo ti = getTypeInfo(klass);
 
-  synchronized (ti) {
-final LevelDBTypeInfo.Index naturalIndex = ti.naturalIndex();
-final Collection indices = ti.indices();
+  // Partition the value list to a set of the 128-values batches. It can 
reduce the
+  // memory pressure caused by serialization and give fairness to other 
writing threads
+  // when writing a very large list.
+  for (List batchList : Iterables.partition(entry.getValue(), 128)) {

Review comment:
   I left a comment in 
https://github.com/apache/spark/pull/28412#pullrequestreview-468098043 . I 
think we should add the unit test cases for verifying the code work as our 
design. We might need to update the implementation to provide the private APIs 
for testing.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu edited a comment on pull request #29439: [SPARK-32624][SQL] Use getCanonicalName to fix byte[] compile issue

2020-08-16 Thread GitBox


maropu edited a comment on pull request #29439:
URL: https://github.com/apache/spark/pull/29439#issuecomment-674526844


   Users can hit this issue? If you have an query to reproduce it, I think its 
better to add end-2-end tests, too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on pull request #29437: [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources

2020-08-16 Thread GitBox


imback82 commented on pull request #29437:
URL: https://github.com/apache/spark/pull/29437#issuecomment-674582324


   cc @cloud-fan @viirya 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >