[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query

2019-01-27 Thread GitBox
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] 
Support recursive SQL query
URL: https://github.com/apache/spark/pull/23531#discussion_r251307934
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ##
 @@ -1530,6 +1650,18 @@ class Analyzer(
   object GlobalAggregates extends Rule[LogicalPlan] {
 def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {
   case Project(projectList, child) if containsAggregates(projectList) =>
+
+def traversePlanAndCheck(plan: LogicalPlan): Unit =
 
 Review comment:
   Ok then I will try to move all the check into that rule.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] httfighter commented on issue #23649: [SPARK-26726] The amount of memory used by the broadcast variable is …

2019-01-27 Thread GitBox
httfighter commented on issue #23649: [SPARK-26726] The amount of memory used 
by the broadcast variable is …
URL: https://github.com/apache/spark/pull/23649#issuecomment-458029276
 
 
   @vanzin Thank you for your review! I have added the test case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] Support recursive SQL query

2019-01-27 Thread GitBox
peter-toth commented on a change in pull request #23531: [SPARK-24497][SQL] 
Support recursive SQL query
URL: https://github.com/apache/spark/pull/23531#discussion_r251307934
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ##
 @@ -1530,6 +1650,18 @@ class Analyzer(
   object GlobalAggregates extends Rule[LogicalPlan] {
 def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {
   case Project(projectList, child) if containsAggregates(projectList) =>
+
+def traversePlanAndCheck(plan: LogicalPlan): Unit =
 
 Review comment:
   Ok then I will try to move all the checks into that rule.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] httfighter commented on a change in pull request #23649: [SPARK-26726] The amount of memory used by the broadcast variable is …

2019-01-27 Thread GitBox
httfighter commented on a change in pull request #23649: [SPARK-26726] The 
amount of memory used by the broadcast variable is …
URL: https://github.com/apache/spark/pull/23649#discussion_r251307769
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala
 ##
 @@ -995,6 +996,30 @@ private[spark] class AppStatusListener(
 }
   }
 
+  def updateBroadcastBlock(event: SparkListenerBlockUpdated, broadcast: 
BroadcastBlockId): Unit = {
 
 Review comment:
   Thank you for your review! I have submitted a new change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] felixcheung commented on issue #23514: [SPARK-24902][K8s] Add PV integration tests

2019-01-27 Thread GitBox
felixcheung commented on issue #23514: [SPARK-24902][K8s] Add PV integration 
tests
URL: https://github.com/apache/spark/pull/23514#issuecomment-458027865
 
 
   @shaneknapp ^^


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] felixcheung commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-01-27 Thread GitBox
felixcheung commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade 
Apache Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-458026691
 
 
   probably a good idea - arrow moves quickly; 0.10 is kinda "dated"


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes 
GetMapValue nullability more precise
URL: https://github.com/apache/spark/pull/23669#issuecomment-458023244
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7479/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue 
nullability more precise
URL: https://github.com/apache/spark/pull/23669#issuecomment-458023240
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23669: [SPARK-26747][SQL] Makes 
GetMapValue nullability more precise
URL: https://github.com/apache/spark/pull/23669#issuecomment-458023240
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue 
nullability more precise
URL: https://github.com/apache/spark/pull/23669#issuecomment-458023244
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7479/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] srowen commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
srowen commented on a change in pull request #23580: [SPARK-26660]Add warning 
logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23580#discussion_r251302189
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,6 +1162,10 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
+  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
 
 Review comment:
   Yeah should have looked at the test logs, my fault.
   1000 = KB, 1024 = KiB. This should be correct unless the constant is misnamed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise

2019-01-27 Thread GitBox
SparkQA commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue 
nullability more precise
URL: https://github.com/apache/spark/pull/23669#issuecomment-458022899
 
 
   **[Test build #101748 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101748/testReport)**
 for PR 23669 at commit 
[`ddd3b56`](https://github.com/apache/spark/commit/ddd3b56421d170bef6f5de67f4049e6a28bba6a1).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] maropu commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise

2019-01-27 Thread GitBox
maropu commented on issue #23669: [SPARK-26747][SQL] Makes GetMapValue 
nullability more precise
URL: https://github.com/apache/spark/pull/23669#issuecomment-458022193
 
 
   It seems `ElementAt` could have the same fix.  This change is trivial, so I 
think it might be better to include an`ElementAt` fix in this pr, too.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] maropu opened a new pull request #23669: [SPARK-26747][SQL] Makes GetMapValue nullability more precise

2019-01-27 Thread GitBox
maropu opened a new pull request #23669: [SPARK-26747][SQL] Makes GetMapValue 
nullability more precise
URL: https://github.com/apache/spark/pull/23669
 
 
   ## What changes were proposed in this pull request?
   IIn master, `GetMapValue` nullable is always true;
   
https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371
   
   But, If input key is foldable, we could make its nullability more precise.
   This fix is the same with SPARK-26637(#23566).
   
   ## How was this patch tested?
   Added tests in `ComplexTypeSuite`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-01-27 Thread GitBox
HyukjinKwon edited a comment on issue #23657: [SPARK-26566][PYTHON][SQL] 
Upgrade Apache Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-458019236
 
 
   For PyArrow 0.12.0, I don't think so but it will run with Arrow 0.12.0 + 
PyArrow 0.8.0 combination. I needs to manual upgrade. Currently, IIRC, the 
version is 0.8.0 to test minimum PyArrow version.
   
   I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so 
that we can reduce such if-else branches, and reduce such overhead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-01-27 Thread GitBox
HyukjinKwon commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade 
Apache Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-458019236
 
 
   For PyArrow 0.12.0, I don't think so but it will run with Arrow 0.12.0 + 
PyArrow 0.8.0 combination. I needs to manual upgrade. Currently, IIRC, the 
version is 0.8.0 to test minimise versions.
   
   I was thinking about bumping up the minimal PyArrow version at Spark 3.0 so 
that we can reduce such if-else branches, and reduce such overhead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] eatoncys commented on issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types

2019-01-27 Thread GitBox
eatoncys commented on issue #23010: [SPARK-26012][SQL]Null and '' values should 
not cause dynamic partition failure of string types
URL: https://github.com/apache/spark/pull/23010#issuecomment-458018613
 
 
   @cloud-fan Would you like to review it again, thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] maropu commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
maropu commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] 
Add warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#discussion_r251298706
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
-  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
+  if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) 
{
 
 Review comment:
   yea, I think 1024 is better..


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] eatoncys removed a comment on issue #23010: [SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types

2019-01-27 Thread GitBox
eatoncys removed a comment on issue #23010: [SPARK-26012][SQL]Null and '' 
values should not cause dynamic partition failure of string types
URL: https://github.com/apache/spark/pull/23010#issuecomment-457778864
 
 
   ping @cloud-fan


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] beliefer commented on a change in pull request #23650: [SPARK-26728]Make rdd.unpersist blocking configurable

2019-01-27 Thread GitBox
beliefer commented on a change in pull request #23650: [SPARK-26728]Make 
rdd.unpersist blocking configurable
URL: https://github.com/apache/spark/pull/23650#discussion_r251297717
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/rdd/RDD.scala
 ##
 @@ -209,13 +210,14 @@ abstract class RDD[T: ClassTag](
*/
   def cache(): this.type = persist()
 
+  private val unpersistBlocking = conf.get(RDD_UNPERSIST_BLOCKING)
   /**
* Mark the RDD as non-persistent, and remove all blocks for it from memory 
and disk.
*
* @param blocking Whether to block until all blocks are deleted.
* @return This RDD.
*/
-  def unpersist(blocking: Boolean = true): this.type = {
+  def unpersist(blocking: Boolean = unpersistBlocking): this.type = {
 
 Review comment:
   Maybe this pr is a temp solution for how to avoid RPC issue. There are 
unable to prove the the measure is effective before a lot of discussion about 
the blocking of unpersist in detail. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] 
Add warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#discussion_r251297474
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
-  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
+  if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) 
{
 
 Review comment:
   We use 1024 at L497 in TaskSetManager:
   
   ```scala
   if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
 !emittedTaskSizeWarning) {
 emittedTaskSizeWarning = true
 logWarning(s"Stage ${task.stageId} contains a task of very large size " +
   s"(${serializedTask.limit() / 1024} KB). The maximum recommended task 
size is " +
   s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
   }
   ```
   
   Shall we use a consistent value?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
viirya commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] 
Add warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#discussion_r251297474
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
-  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
+  if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) 
{
 
 Review comment:
   We use 1024 at L497:
   
   ```scala
   if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
 !emittedTaskSizeWarning) {
 emittedTaskSizeWarning = true
 logWarning(s"Stage ${task.stageId} contains a task of very large size " +
   s"(${serializedTask.limit() / 1024} KB). The maximum recommended task 
size is " +
   s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
   }
   ```
   
   Shall we use a consistent value?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] gatorsmile commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0

2019-01-27 Thread GitBox
gatorsmile commented on issue #23657: [SPARK-26566][PYTHON][SQL] Upgrade Apache 
Arrow to version 0.12.0
URL: https://github.com/apache/spark/pull/23657#issuecomment-458015915
 
 
   Will our Jenkin run Arrow 0.12? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy 
TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458013714
 
 
   **[Test build #101747 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101747/testReport)**
 for PR 23656 at commit 
[`52ac627`](https://github.com/apache/spark/commit/52ac627589db4ad63bf959406ad70b9d8790f40a).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use 
dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458013641
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] 
Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458013646
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7478/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] 
Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458013641
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable

2019-01-27 Thread GitBox
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make 
broadcast-exchange thread pool configurable
URL: https://github.com/apache/spark/pull/23519#discussion_r251292265
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala
 ##
 @@ -132,4 +135,33 @@ class ExchangeSuite extends SparkPlanTest with 
SharedSQLContext {
 val projection2 = cached.select("_1", "_3").queryExecution.executedPlan
 assert(!projection1.sameResult(projection2))
   }
+
+  test("SPARK-26601: Make broadcast-exchange thread pool configurable") {
+val previousNumber = SparkSession.getActiveSession.get.sparkContext.conf
+  .get(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER)
+
+SparkSession.getActiveSession.get.sparkContext.conf.
+  set(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER, 1)
+
assert(SQLConf.get.getConf(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER) 
=== 1)
+
+Future {
+  Thread.sleep(5*1000)
+} (BroadcastExchangeExec.executionContext)
+
+val f = Future {} (BroadcastExchangeExec.executionContext)
 
 Review comment:
   Actually, i have thought about this method. But i can not call 
`getMaximumPoolSize` api, then i change the solution.
   I will search more about this idea


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use 
dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458013646
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7478/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable

2019-01-27 Thread GitBox
HyukjinKwon commented on a change in pull request #23519: [SPARK-26601][SQL] 
Make broadcast-exchange thread pool configurable
URL: https://github.com/apache/spark/pull/23519#discussion_r251294071
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
 ##
 @@ -126,4 +126,13 @@ object StaticSQLConf {
   .intConf
   .createWithDefault(1000)
 
+  val MAX_BROADCAST_EXCHANGE_THREADNUMBER =
+buildStaticConf("spark.sql.broadcastExchange.maxThreadNumber")
+  .doc("The maximum degree of parallelism to fetch and broadcast the 
table.If we " +
 
 Review comment:
   If that's around this code and that's the only the one, yea, let's do that. 
If there are multiple across this files, let's don't include.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] dongjoon-hyun commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
dongjoon-hyun commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use 
dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458012910
 
 
   Retest this please.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon commented on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables

2019-01-27 Thread GitBox
HyukjinKwon commented on issue #23336: [SPARK-26378][SQL] Restore performance 
of queries against wide CSV/JSON tables
URL: https://github.com/apache/spark/pull/23336#issuecomment-458012967
 
 
   I think it's clear that it saves the tame. Let's just push the results file 
here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
SparkQA removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use 
dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-457993454
 
 
   **[Test build #101743 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101743/testReport)**
 for PR 23656 at commit 
[`52ac627`](https://github.com/apache/spark/commit/52ac627589db4ad63bf959406ad70b9d8790f40a).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use 
dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458012248
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] 
Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458012250
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101743/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23656: [SPARK-26379][SS][BRANCH-2.3] 
Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458012248
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use 
dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458012250
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101743/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread GitBox
SparkQA commented on issue #23656: [SPARK-26379][SS][BRANCH-2.3] Use dummy 
TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
URL: https://github.com/apache/spark/pull/23656#issuecomment-458012077
 
 
   **[Test build #101743 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101743/testReport)**
 for PR 23656 at commit 
[`52ac627`](https://github.com/apache/spark/commit/52ac627589db4ad63bf959406ad70b9d8790f40a).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI

2019-01-27 Thread GitBox
deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 
partition job in WebUI
URL: https://github.com/apache/spark/pull/23637#issuecomment-458003956
 
 
   Yes, it takes no time at all and It always succeeds.  Maybe using the same 
time in `SparkListenerJobStart` and `SparkListenerJobEnd` will be better. In 
addition,  the method `submitJob` is invoked in two positions. I don't want to 
handle it twice. Hence, I think the best place to place the code is the method 
`submitJob` itself.
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable

2019-01-27 Thread GitBox
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make 
broadcast-exchange thread pool configurable
URL: https://github.com/apache/spark/pull/23519#discussion_r251292265
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/ExchangeSuite.scala
 ##
 @@ -132,4 +135,33 @@ class ExchangeSuite extends SparkPlanTest with 
SharedSQLContext {
 val projection2 = cached.select("_1", "_3").queryExecution.executedPlan
 assert(!projection1.sameResult(projection2))
   }
+
+  test("SPARK-26601: Make broadcast-exchange thread pool configurable") {
+val previousNumber = SparkSession.getActiveSession.get.sparkContext.conf
+  .get(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER)
+
+SparkSession.getActiveSession.get.sparkContext.conf.
+  set(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER, 1)
+
assert(SQLConf.get.getConf(StaticSQLConf.MAX_BROADCAST_EXCHANGE_THREADNUMBER) 
=== 1)
+
+Future {
+  Thread.sleep(5*1000)
+} (BroadcastExchangeExec.executionContext)
+
+val f = Future {} (BroadcastExchangeExec.executionContext)
 
 Review comment:
   Actually, i have thought about this method. But i can not call 
`getMaximumPoolSize` api, then i change the solution.
   I will search more about this idea


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
HyukjinKwon commented on a change in pull request #23668: 
[SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#discussion_r251292205
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
-  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
+  if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) 
{
 
 Review comment:
   it's KB not KiB, isn't it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23614: [SPARK-26689]Support blacklisting bad disk directory and retry in DiskBlockManager

2019-01-27 Thread GitBox
liupc commented on issue #23614: [SPARK-26689]Support blacklisting bad disk 
directory and retry in DiskBlockManager
URL: https://github.com/apache/spark/pull/23614#issuecomment-458010952
 
 
   cc @srowen @vanzin  @HyukjinKwon @squito @dongjoon-hyun 
   Any body have a look at this PR and give some suggestions?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
liupc commented on issue #23580: [SPARK-26660]Add warning logs when 
broadcasting large task binary
URL: https://github.com/apache/spark/pull/23580#issuecomment-458010241
 
 
   @maropu @srowen @HyukjinKwon Really sorry for that, I will be more carefully 
next time!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] bersprockets edited a comment on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables

2019-01-27 Thread GitBox
bersprockets edited a comment on issue #23336: [SPARK-26378][SQL] Restore 
performance of queries against wide CSV/JSON tables
URL: https://github.com/apache/spark/pull/23336#issuecomment-458008758
 
 
   @HyukjinKwon 
   
   The last time CSVBenchmark-results.txt or JSONBenchmark-results.txt was 
committed to master was roughly 200 commits before the partial results change.
   
   Also, this PR contains some changes to CSVBenchmark.scala and 
JsonBenchmark.scala. Until this PR, JsonBenchmark.scala had no test for wide 
rows (which is where the issue really shows up).
   
   Thought 1:
   
   I am thinking of creating a separate PR with only the CSVBenchmark.scala and 
JsonBenchmark.scala changes, plus the two results files. This would be the 
baseline, and it would include wide-row tests.
   
   After that PR is merged, I would run the benchmarks again, but against this 
PR. I would include the new results files as part of this PR.
   
   Or, thought 2:
   
   I could just update CSVBenchmark.scala and JsonBenchmark.scala in a local 
working copy of the baseline, run benchmarks for both my local baseline and the 
PR, and simply verify that the PR fixes the issue. Then, I would commit the two 
results file to this PR.
   
   Lastly, thought 3:
   
   Maybe we don't need the changes to CSVBenchmark.scala and 
JsonBenchmark.scala. I would just commit the results files without the new 
cases.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] bersprockets commented on issue #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV/JSON tables

2019-01-27 Thread GitBox
bersprockets commented on issue #23336: [SPARK-26378][SQL] Restore performance 
of queries against wide CSV/JSON tables
URL: https://github.com/apache/spark/pull/23336#issuecomment-458008758
 
 
   @HyukjinKwon 
   
   The last time CSVBenchmark-results.txt or JSONBenchmark-results.txt was 
committed to master was roughly 200 commits before the partial results change.
   
   Also, this PR contains some changes to CSVBenchmark.scala and 
JsonBenchmark.scala. Until this PR, JsonBenchmark.scala had no test for wide 
rows (which is where the issue really shows up).
   
   Thought #1:
   
   I am thinking of creating a separate PR with only the CSVBenchmark.scala and 
JsonBenchmark.scala changes, plus the two results files. This would be the 
baseline, and it would include wide-row tests.
   
   After that PR is merged, I would run the benchmarks again, but against this 
PR. I would include the new results files as part of this PR.
   
   Or, thought #2:
   
   I could just update CSVBenchmark.scala and JsonBenchmark.scala in a local 
working copy of the baseline, run benchmarks for both my local baseline and the 
PR, and simply verify that the PR fixes the issue. Then, I would commit the two 
results file to this PR.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable

2019-01-27 Thread GitBox
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make 
broadcast-exchange thread pool configurable
URL: https://github.com/apache/spark/pull/23519#discussion_r251289944
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
 ##
 @@ -126,4 +126,13 @@ object StaticSQLConf {
   .intConf
   .createWithDefault(1000)
 
+  val MAX_BROADCAST_EXCHANGE_THREADNUMBER =
+buildStaticConf("spark.sql.broadcastExchange.maxThreadNumber")
+  .doc("The maximum degree of parallelism to fetch and broadcast the 
table.If we " +
 
 Review comment:
   Sorry for this style @HyukjinKwon actually i found some other code has the 
same problem,can i open pr to fix that?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make broadcast-exchange thread pool configurable

2019-01-27 Thread GitBox
caneGuy commented on a change in pull request #23519: [SPARK-26601][SQL] Make 
broadcast-exchange thread pool configurable
URL: https://github.com/apache/spark/pull/23519#discussion_r251289872
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala
 ##
 @@ -126,4 +126,13 @@ object StaticSQLConf {
   .intConf
   .createWithDefault(1000)
 
+  val MAX_BROADCAST_EXCHANGE_THREADNUMBER =
 
 Review comment:
   Ok @maropu Thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
liupc commented on a change in pull request #23580: [SPARK-26660]Add warning 
logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23580#discussion_r251289438
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,6 +1162,10 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
+  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
 
 Review comment:
   @HyukjinKwon Sorry for that, also in the new PR we can make 1000 to 1024, 
that sounds more reasonable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] deshanxiao commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
deshanxiao commented on a change in pull request #23668: 
[SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#discussion_r251289378
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
-  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
+  if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) 
{
 
 Review comment:
   More precise I think would be 1024, is it right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] deshanxiao commented on a change in pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
deshanxiao commented on a change in pull request #23668: 
[SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#discussion_r251289378
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,7 +1162,7 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
-  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
+  if (taskBinaryBytes.length > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1000) 
{
 
 Review comment:
   More precisely I think would be 1024, is it right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon commented on a change in pull request #23580: [SPARK-26660]Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
HyukjinKwon commented on a change in pull request #23580: [SPARK-26660]Add 
warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23580#discussion_r251287698
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
 ##
 @@ -1162,6 +1162,10 @@ private[spark] class DAGScheduler(
 partitions = stage.rdd.partitions
   }
 
+  if (taskBinaryBytes.length * 1000 > TaskSetManager.TASK_SIZE_TO_WARN_KB) 
{
 
 Review comment:
   @liupc, even if it was a minor log change, manual test should better be done 
to avoid such mistake.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI

2019-01-27 Thread GitBox
deshanxiao edited a comment on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 
partition job in WebUI
URL: https://github.com/apache/spark/pull/23637#issuecomment-458003956
 
 
   Yes, it takes no time at all and It always succeeds. I think in this case, 
we could not check it twice. Event not creating a JobWaiter if necessary


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] deshanxiao commented on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 partition job in WebUI

2019-01-27 Thread GitBox
deshanxiao commented on issue #23637: [SPARK-26714][CORE][WEBUI] Show 0 
partition job in WebUI
URL: https://github.com/apache/spark/pull/23637#issuecomment-458003956
 
 
   Yes, it takes no time at all and It always succeeds. I think in this case, 
we could not check it twich. Event not creating a JobWaiter if necessary


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in 
JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-458003268
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101739/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty 
lines in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-458003268
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/101739/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
SparkQA removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines 
in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-457973075
 
 
   **[Test build #101739 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101739/testReport)**
 for PR 23665 at commit 
[`cd2f30c`](https://github.com/apache/spark/commit/cd2f30cbec303356e276ab83a83f092b8a2c36e6).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23665: [SPARK-26745][SQL] Skip empty 
lines in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-458003266
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
SparkQA commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in 
JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-458003043
 
 
   **[Test build #101739 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101739/testReport)**
 for PR 23665 at commit 
[`cd2f30c`](https://github.com/apache/spark/commit/cd2f30cbec303356e276ab83a83f092b8a2c36e6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in 
JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-458003266
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] zjf2012 commented on issue #23560: [SPARK-26632][Spark Core] Separate Thread Configurations of Driver and Executor

2019-01-27 Thread GitBox
zjf2012 commented on issue #23560: [SPARK-26632][Spark Core] Separate Thread 
Configurations of Driver and Executor
URL: https://github.com/apache/spark/pull/23560#issuecomment-458002943
 
 
   @jerryshao , I checked configuration.md as well as other related md files 
under the docs folder. I cannot find a proper place to put my documentation. 
Actually, the spark documentation even doesn't mention IO threads-related 
configurations. Would you please give some suggestion?
   
   thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add 
warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#issuecomment-458001541
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning 
logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#issuecomment-458001546
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7477/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] maropu commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
maropu commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when 
broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#issuecomment-458001665
 
 
   LGTM


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] viirya commented on a change in pull request #22193: [SPARK-25186][SQL] Remove v2 save mode.

2019-01-27 Thread GitBox
viirya commented on a change in pull request #22193: [SPARK-25186][SQL] Remove 
v2 save mode.
URL: https://github.com/apache/spark/pull/22193#discussion_r251284321
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ##
 @@ -143,14 +145,17 @@ object DataSourceV2Strategy extends Strategy {
 DataSourceV2ScanExec(
   r.output, r.source, r.options, r.pushedFilters, r.readSupport, 
scanConfig)) :: Nil
 
-case WriteToDataSourceV2(writer, query) =>
-  WriteToDataSourceV2Exec(writer, planLater(query)) :: Nil
+case WriteToDataSourceV2(writer, writeConfig, query) =>
+  WriteToDataSourceV2Exec(writer, writeConfig, planLater(query)) :: Nil
 
 case AppendData(r: DataSourceV2Relation, query, _) =>
-  WriteToDataSourceV2Exec(r.newWriteSupport(), planLater(query)) :: Nil
+  val writeSupport = r.newWriteSupport()
+  val writeConfig = writeSupport.createWriteConfig(
+query.schema, new DataSourceOptions(r.options.asJava))
+  WriteToDataSourceV2Exec(r.newWriteSupport(), writeConfig, 
planLater(query)) :: Nil
 
 Review comment:
   Why do `r.newWriteSupport()` again? We can reuse `writeSupport`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23668: [SPARK-26660][FOLLOWUP] Add 
warning logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#issuecomment-458001546
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7477/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning 
logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#issuecomment-458001541
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] srowen commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees

2019-01-27 Thread GitBox
srowen commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample weights 
to decision trees
URL: https://github.com/apache/spark/pull/21632#issuecomment-458001245
 
 
   Is there a good reason to scale it by the square of the samples? if not, 
yeah, worth a follow-up. If there is a good reason, then is there a case in the 
tests here where epsilon becomes really large, like of the same order of 
magnitude as the expected values? I don't think the tests have ~1e8 samples. Up 
to your judgment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
SparkQA commented on issue #23668: [SPARK-26660][FOLLOWUP] Add warning logs 
when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668#issuecomment-458001181
 
 
   **[Test build #101746 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101746/testReport)**
 for PR 23668 at commit 
[`e53831a`](https://github.com/apache/spark/commit/e53831a20a91fb7aed88ac57dfac9dd9671b05f7).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HeartSaVioR edited a comment on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows

2019-01-27 Thread GitBox
HeartSaVioR edited a comment on issue #23634: [SPARK-26154][SS] Streaming 
left/right outer join should not return outer nulls for already matched rows
URL: https://github.com/apache/spark/pull/23634#issuecomment-458000808
 
 
   > The rows in one of the inputs are immediately discarded while I think both 
inputs should be retained till the watermark of the operator advances so that 
the correct results can be produced.
   
   The rows were stored to states in both sides in batch 1: they just evicted 
from states at different batches (batch 3 for left side, batch 2 for right 
side), which global watermark was advanced in batch 2, so the right-side 
eviction from batch 2 was valid under the watermark condition. Left-side "late" 
eviction is just due to join condition.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HeartSaVioR commented on issue #23634: [SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows

2019-01-27 Thread GitBox
HeartSaVioR commented on issue #23634: [SPARK-26154][SS] Streaming left/right 
outer join should not return outer nulls for already matched rows
URL: https://github.com/apache/spark/pull/23634#issuecomment-458000808
 
 
   > The rows in one of the inputs are immediately discarded while I think both 
inputs should be retained till the watermark of the operator advances so that 
the correct results can be produced.
   
   The rows were stored to states in both sides in batch 1: they just evicted 
from states at different batches (batch 3 for left side, batch 2 for right 
side), which global watermark was advanced in batch 2, so the right-side 
eviction from batch 2 was valid under the watermark condition. Left-side late 
eviction is just due to join condition.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] srowen opened a new pull request #23668: [SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
srowen opened a new pull request #23668: [SPARK-26660][FOLLOWUP] Add warning 
logs when broadcasting large task binary
URL: https://github.com/apache/spark/pull/23668
 
 
   ## What changes were proposed in this pull request?
   
   The warning introduced in https://github.com/apache/spark/pull/23580 has a 
bug: https://github.com/apache/spark/pull/23580#issuecomment-458000380 This 
just fixes the logic.
   
   ## How was this patch tested?
   
   N/A


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] maropu commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
maropu commented on issue #23580: [SPARK-26660]Add warning logs when 
broadcasting large task binary
URL: https://github.com/apache/spark/pull/23580#issuecomment-458000585
 
 
   oh...thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] srowen commented on issue #23580: [SPARK-26660]Add warning logs when broadcasting large task binary

2019-01-27 Thread GitBox
srowen commented on issue #23580: [SPARK-26660]Add warning logs when 
broadcasting large task binary
URL: https://github.com/apache/spark/pull/23580#issuecomment-458000380
 
 
   Oh, shoot, there's a bug here: `if (taskBinaryBytes.length * 1000 > 
TaskSetManager.TASK_SIZE_TO_WARN_KB)`
   The `* 1000` should be on the other size. I read right past that. I'll make 
a follow up. That should eliminate a lot of the warnings!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line

2019-01-27 Thread GitBox
sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve 
the behavior of sql text splitting for the spark-sql command line 
URL: https://github.com/apache/spark/pull/23276#discussion_r251283220
 
 

 ##
 File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ##
 @@ -331,6 +337,65 @@ private[hive] class SparkSQLCLIDriver extends CliDriver 
with Logging {
 console.printInfo(s"Spark master: $master, Application Id: $appId")
   }
 
+  // method body imported from Hive and translated from Java to Scala
+  override def processLine(line: String, allowInterrupting: Boolean): Int = {
 
 Review comment:
   Yes,and I had not checked the impl on Hive master. We may judge which impl 
is better


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] srowen commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.

2019-01-27 Thread GitBox
srowen commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix 
inconsistency between toJSON Method in Python and Scala.
URL: https://github.com/apache/spark/pull/23534#discussion_r251283106
 
 

 ##
 File path: docs/sql-migration-guide-upgrade.md
 ##
 @@ -45,6 +45,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, if 
`org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with 
primitive-type argument, the returned UDF will return null if the input values 
is null. Since Spark 3.0, the UDF will return the default value of the Java 
type if the input value is null. For example, `val f = udf((x: Int) => x, 
IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column 
`x` is null, and return 0 in Spark 3.0. This behavior change is introduced 
because Spark 3.0 is built with Scala 2.12 by default.
 
+  - Since Spark 3.0, `DataFrame.toJSON()` in PySpark returns `DataFrame` of 
JSON string instead of `RDD`. The method in Scala/Java was changed to return 
`DataFrame` before, but the one in PySpark was not changed at that time. If you 
still want to return `RDD`, you can restore the previous behavior by setting 
`spark.sql.legacy.pyspark.toJsonShouldReturnDataFrame` to `false`.
 
 Review comment:
   The current Scala method does _not_ return a DataFrame. That's where I'm 
confused about this argument. Yes, either that has to change too, or this 
change actually makes it less consistent.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert 
count optimization in JSON datasource by SPARK-24959
URL: https://github.com/apache/spark/pull/23667#issuecomment-457999378
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23667: [SPARK-26745][SQL] Revert 
count optimization in JSON datasource by SPARK-24959
URL: https://github.com/apache/spark/pull/23667#issuecomment-457999380
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7476/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count 
optimization in JSON datasource by SPARK-24959
URL: https://github.com/apache/spark/pull/23667#issuecomment-457999378
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23667: [SPARK-26745][SQL] Revert count 
optimization in JSON datasource by SPARK-24959
URL: https://github.com/apache/spark/pull/23667#issuecomment-457999380
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/7476/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] SparkQA commented on issue #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959

2019-01-27 Thread GitBox
SparkQA commented on issue #23667: [SPARK-26745][SQL] Revert count optimization 
in JSON datasource by SPARK-24959
URL: https://github.com/apache/spark/pull/23667#issuecomment-457999116
 
 
   **[Test build #101745 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101745/testReport)**
 for PR 23667 at commit 
[`dd5b177`](https://github.com/apache/spark/commit/dd5b1771a95f139d3773f4d796fc29cd61de0ff1).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
HyukjinKwon commented on issue #23665: [SPARK-26745][SQL] Skip empty lines in 
JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-457998561
 
 
   @sumitsu, if we agree upon reverting it (at 23667), let's convert this PR 
into a test-only PR. We can add the test.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon edited a comment on issue #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
HyukjinKwon edited a comment on issue #23665: [SPARK-26745][SQL] Skip empty 
lines in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#issuecomment-457998561
 
 
   @sumitsu, if we agree upon reverting it (at #23667), let's convert this PR 
into a test-only PR. We can add the test.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon opened a new pull request #23667: [SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959

2019-01-27 Thread GitBox
HyukjinKwon opened a new pull request #23667: [SPARK-26745][SQL] Revert count 
optimization in JSON datasource by SPARK-24959
URL: https://github.com/apache/spark/pull/23667
 
 
   ## What changes were proposed in this pull request?
   
   This PR reverts JSON count optimization part of #21909.
   
   We cannot distinguish the cases below without parsing:
   
   ```
   [{...}, {...}]
   ```
   
   ```
   []
   ```
   
   ```
   {...}
   ```
   
   ```bash
   # empty string
   ```
   
   when we `count()`. One line (input: IN) can be, 0 record, 1 record and 
multiple records and this is dependent on each input.
   
   See also https://github.com/apache/spark/pull/23665#discussion_r251276720.
   
   ## How was this patch tested?
   
   Manually tested.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] maropu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

2019-01-27 Thread GitBox
maropu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip 
empty lines in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251281725
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ##
 @@ -55,11 +56,15 @@ class FailureSafeParser[IN](
 
   def parse(input: IN): Iterator[InternalRow] = {
 try {
- if (skipParsing) {
-   Iterator.single(InternalRow.empty)
- } else {
-   rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () 
=> null))
- }
+  if (skipParsing) {
+if (unparsedRecordIsNonEmpty(input)) {
+  Iterator.single(InternalRow.empty)
+} else {
+  Iterator.empty
+}
+  } else {
+rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () 
=> null))
+  }
 
 Review comment:
   I thought this: 
https://github.com/maropu/spark/commit/f4df9074619e6cafff679c5f22fa153727be1079
   But, you should follow other guys who are familiar this part, @HyukjinKwon 
and @MaxGekk 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] cloud-fan commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line

2019-01-27 Thread GitBox
cloud-fan commented on a change in pull request #23276: [SPARK-26321][SQL] 
Improve the behavior of sql text splitting for the spark-sql command line 
URL: https://github.com/apache/spark/pull/23276#discussion_r251281683
 
 

 ##
 File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ##
 @@ -331,6 +337,65 @@ private[hive] class SparkSQLCLIDriver extends CliDriver 
with Logging {
 console.printInfo(s"Spark master: $master, Application Id: $appId")
   }
 
+  // method body imported from Hive and translated from Java to Scala
+  override def processLine(line: String, allowInterrupting: Boolean): Int = {
 
 Review comment:
   I checked the hive code and seems the `;` is well handled at least in the 
master branch: 
https://github.com/apache/hive/blob/master/cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java#L395
   
   Do you mean only Hive 1.2 has this bug? Maybe we should upgrade hive.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] imatiach-msft commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample weights to decision trees

2019-01-27 Thread GitBox
imatiach-msft commented on issue #21632: [SPARK-19591][ML][MLlib] Add sample 
weights to decision trees
URL: https://github.com/apache/spark/pull/21632#issuecomment-457997552
 
 
   @srowen thank you for the merge and the thorough review.  I have some doubts 
about the tolerance we decided for zero values:
   val tolerance = Utils.EPSILON * unweightedNumSamples * unweightedNumSamples
   
   
https://github.com/apache/spark/pull/21632/files#diff-1fd1bc8d3fc9306c83cd65fbf3ca4bbeR1054
   
   For a large number of unweighted samples I am worried that it might be too 
high.  Note EPSILON=2.2E-16.  I am wondering if I should change the tolerance 
to be:
   val tolerance = Utils.EPSILON * unweightedNumSamples * (some constant)
   What are your thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.

2019-01-27 Thread GitBox
ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix 
inconsistency between toJSON Method in Python and Scala.
URL: https://github.com/apache/spark/pull/23534#discussion_r251281383
 
 

 ##
 File path: docs/sql-migration-guide-upgrade.md
 ##
 @@ -45,6 +45,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, if 
`org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with 
primitive-type argument, the returned UDF will return null if the input values 
is null. Since Spark 3.0, the UDF will return the default value of the Java 
type if the input value is null. For example, `val f = udf((x: Int) => x, 
IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column 
`x` is null, and return 0 in Spark 3.0. This behavior change is introduced 
because Spark 3.0 is built with Scala 2.12 by default.
 
+  - Since Spark 3.0, `DataFrame.toJSON()` in PySpark returns `DataFrame` of 
JSON string instead of `RDD`. The method in Scala/Java was changed to return 
`DataFrame` before, but the one in PySpark was not changed at that time. If you 
still want to return `RDD`, you can restore the previous behavior by setting 
`spark.sql.legacy.pyspark.toJsonShouldReturnDataFrame` to `false`.
 
 Review comment:
   Actually I'm still feeling it's inconsistent because the abstraction layer 
is different between RDD and DataFrame/Dataset. I guess users expect it returns 
something of the same abstraction, i.e., RDD returns RDD, DataFrame returns 
DataFrame, so I'd rather handle Dataset as DataFrame in Python than RDD.We 
might still need to discuss the function `map` or the behavior of 
`DataFrameReader.csv`/`.json` as @HyukjinKwon mentioned.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix inconsistency between toJSON Method in Python and Scala.

2019-01-27 Thread GitBox
ueshin commented on a change in pull request #23534: [SPARK-26610][PYTHON] Fix 
inconsistency between toJSON Method in Python and Scala.
URL: https://github.com/apache/spark/pull/23534#discussion_r251281383
 
 

 ##
 File path: docs/sql-migration-guide-upgrade.md
 ##
 @@ -45,6 +45,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.4 and earlier, if 
`org.apache.spark.sql.functions.udf(Any, DataType)` gets a Scala closure with 
primitive-type argument, the returned UDF will return null if the input values 
is null. Since Spark 3.0, the UDF will return the default value of the Java 
type if the input value is null. For example, `val f = udf((x: Int) => x, 
IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if column 
`x` is null, and return 0 in Spark 3.0. This behavior change is introduced 
because Spark 3.0 is built with Scala 2.12 by default.
 
+  - Since Spark 3.0, `DataFrame.toJSON()` in PySpark returns `DataFrame` of 
JSON string instead of `RDD`. The method in Scala/Java was changed to return 
`DataFrame` before, but the one in PySpark was not changed at that time. If you 
still want to return `RDD`, you can restore the previous behavior by setting 
`spark.sql.legacy.pyspark.toJsonShouldReturnDataFrame` to `false`.
 
 Review comment:
   Actually I'm still feeling it's inconsistent because the abstraction layer 
is different between RDD and DataFrame/Dataset. I guess users expect it returns 
something of the same abstraction, i.e., RDD returns RDD, DataFrame returns 
DataFrame, so I'd rather handle Dataset as DataFrame in Python than RDD.
   We might still need to discuss the function `map` or the behavior of 
`DataFrameReader.csv`/`.json` as @HyukjinKwon mentioned.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] cloud-fan commented on issue #23645: [SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-27 Thread GitBox
cloud-fan commented on issue #23645: [SPARK-26725][TEST] Fix the input values 
of UnifiedMemoryManager constructor in test suites
URL: https://github.com/apache/spark/pull/23645#issuecomment-457997323
 
 
   thanks, merging to master!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] cloud-fan closed pull request #23645: [SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-27 Thread GitBox
cloud-fan closed pull request #23645: [SPARK-26725][TEST] Fix the input values 
of UnifiedMemoryManager constructor in test suites
URL: https://github.com/apache/spark/pull/23645
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HeartSaVioR commented on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation

2019-01-27 Thread GitBox
HeartSaVioR commented on issue #23666: [SPARK-26718][SS][Master] Fixed integer 
overflow in SS kafka rateLimit calculation 
URL: https://github.com/apache/spark/pull/23666#issuecomment-457997277
 
 
   You might be confused regarding policy for the first time, but we only add 
`[BRANCH-NAME]` for PRs against non-master branch. You can remove `[Master]` in 
PR's title.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] cloud-fan commented on a change in pull request #23644: [SPARK-26708][SQL] Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2019-01-27 Thread GitBox
cloud-fan commented on a change in pull request #23644: [SPARK-26708][SQL] 
Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
its physical plan
URL: https://github.com/apache/spark/pull/23644#discussion_r251281049
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
 ##
 @@ -180,7 +180,26 @@ class CacheManager extends Logging {
   val it = cachedData.iterator()
   while (it.hasNext) {
 val cd = it.next()
-if (condition(cd.plan)) {
+// If `clearCache` is false (which means the recache request comes 
from a non-cascading
+// cache invalidation) and the cache buffer has already been loaded, 
we do not need to
+// re-compile a physical plan because the old plan will not be used 
any more by the
+// CacheManager although it still lives in compiled `Dataset`s and it 
could still work.
 
 Review comment:
   > the old plan will not be used any more by the CacheManager
   
   How do we guarantee it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HeartSaVioR commented on a change in pull request #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation

2019-01-27 Thread GitBox
HeartSaVioR commented on a change in pull request #23666: 
[SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit 
calculation 
URL: https://github.com/apache/spark/pull/23666#discussion_r251280936
 
 

 ##
 File path: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala
 ##
 @@ -231,7 +231,11 @@ private[kafka010] class KafkaMicroBatchStream(
 val begin = from.get(tp).getOrElse(fromNew(tp))
 val prorate = limit * (size / total)
 // Don't completely starve small topicpartitions
-val off = begin + (if (prorate < 1) Math.ceil(prorate) else 
Math.floor(prorate)).toLong
+val prorateLong = (if (prorate < 1) Math.ceil(prorate) else 
Math.floor(prorate)).toLong
+// need to be careful of integer overflow
+// therefore added canary checks where to see if off variable 
could be overflowed
+// refer to [https://issues.apache.org/jira/browse/SPARK-26718]
+val off = if (prorateLong > Long.MaxValue - begin) Long.MaxValue 
else begin + prorateLong
 
 Review comment:
   Line length is limited to 100 chars and this line exceeds it.
   
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101744/console
   
   You are encouraged to run `dev/scalastyle` and fix all style issues before 
submitting additional commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-27 Thread GitBox
liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories 
for executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661
 
 
   @vanzin @HyukjinKwon 
   we once run into a similar problem on Spark2.0.1 when 
https://github.com/apache/spark/pull/14162 is not introduced. The disk broken 
of recovery path caused NM started without `YarnShuffleService`, so that 
executors scheduled on the node were unable to register with 
`YarnShuffleService`, and finally caused the application failure.
   
   Even though, we now have https://github.com/apache/spark/pull/14162 and the 
application level blacklist, but I think this PR still make sense for long 
running applications(for instance, Spark ThriftServer applications or spark 
streaming applications).
   For these type of applications, this case might not be a uncommon thing for 
they are running for a long time, and even if we suppose spark would recover 
with application level blacklist enabled, it will still cause resource waste, 
for shuffle will always fail on the node, not not mention that there are 
chances that the node is not blacklisted and will cause job failure. And the 
resource waste or job failure is unacceptable for a thriftServer or streaming 
applications.
   
   Hope my explanation can make you convinced.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve the behavior of sql text splitting for the spark-sql command line

2019-01-27 Thread GitBox
sadhen commented on a change in pull request #23276: [SPARK-26321][SQL] Improve 
the behavior of sql text splitting for the spark-sql command line 
URL: https://github.com/apache/spark/pull/23276#discussion_r251280587
 
 

 ##
 File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ##
 @@ -331,6 +337,65 @@ private[hive] class SparkSQLCLIDriver extends CliDriver 
with Logging {
 console.printInfo(s"Spark master: $master, Application Id: $appId")
   }
 
+  // method body imported from Hive and translated from Java to Scala
+  override def processLine(line: String, allowInterrupting: Boolean): Int = {
 
 Review comment:
   yes,there is a buggy impl in hive


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] HyukjinKwon commented on a change in pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2019-01-27 Thread GitBox
HyukjinKwon commented on a change in pull request #21909: [SPARK-24959][SQL] 
Speed up count() for JSON and CSV
URL: https://github.com/apache/spark/pull/21909#discussion_r251280295
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ##
 @@ -402,7 +403,7 @@ class JacksonParser(
 }
   }
 } catch {
-  case e @ (_: RuntimeException | _: JsonProcessingException) =>
+  case e @ (_: RuntimeException | _: JsonProcessingException | _: 
MalformedInputException) =>
 
 Review comment:
   Is this change related, @MaxGekk? Let's don't add unrelated changes next 
time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-27 Thread GitBox
liupc commented on issue #23647: [SPARK-26712]Support multi directories for 
executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661
 
 
   @vanzin @HyukjinKwon 
   we once run into a similar problem on Spark2.0.1 when 
https://github.com/apache/spark/pull/14162 is not introduced. The disk broken 
of recovery path caused NM started without `YarnShuffleService`, so that 
executors scheduled on the node were unable to register with 
`YarnShuffleService`, and finally caused the application failure.
   Even though, we now have https://github.com/apache/spark/pull/14162 and the 
application level blacklist, but I think this PR still make sense for long 
running applications(for instance, Spark ThriftServer applications or spark 
streaming applications).
   For these type of applications, this case might not be a uncommon thing for 
they are running for a long time.
   and even if we suppose spark would recover with application level blacklist 
enabled, it will still cause resource waste, for shuffle will always fail on 
the node, not not mention that there are chances that the node is not 
blacklisted and will cause job failure.
   Hope my explanation can make you convinced.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce

2019-01-27 Thread GitBox
liupc edited a comment on issue #23647: [SPARK-26712]Support multi directories 
for executor shuffle info recovery in yarn shuffle serivce
URL: https://github.com/apache/spark/pull/23647#issuecomment-457995661
 
 
   @vanzin @HyukjinKwon 
   we once run into a similar problem on Spark2.0.1 when 
https://github.com/apache/spark/pull/14162 is not introduced. The disk broken 
of recovery path caused NM started without `YarnShuffleService`, so that 
executors scheduled on the node were unable to register with 
`YarnShuffleService`, and finally caused the application failure.
   
   Even though, we now have https://github.com/apache/spark/pull/14162 and the 
application level blacklist, but I think this PR still make sense for long 
running applications(for instance, Spark ThriftServer applications or spark 
streaming applications).
   For these type of applications, this case might not be a uncommon thing for 
they are running for a long time, and even if we suppose spark would recover 
with application level blacklist enabled, it will still cause resource waste, 
for shuffle will always fail on the node, not not mention that there are 
chances that the node is not blacklisted and will cause job failure.
   
   Hope my explanation can make you convinced.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins commented on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation

2019-01-27 Thread GitBox
AmplabJenkins commented on issue #23666: [SPARK-26718][SS][Master] Fixed 
integer overflow in SS kafka rateLimit calculation 
URL: https://github.com/apache/spark/pull/23666#issuecomment-457994835
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] AmplabJenkins removed a comment on issue #23666: [SPARK-26718][SS][Master] Fixed integer overflow in SS kafka rateLimit calculation

2019-01-27 Thread GitBox
AmplabJenkins removed a comment on issue #23666: [SPARK-26718][SS][Master] 
Fixed integer overflow in SS kafka rateLimit calculation 
URL: https://github.com/apache/spark/pull/23666#issuecomment-457994835
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >