[GitHub] [spark] dilipbiswal commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


dilipbiswal commented on a change in pull request #28953:
URL: https://github.com/apache/spark/pull/28953#discussion_r447427847



##
File path: docs/sql-data-sources-jdbc.md
##
@@ -156,6 +156,20 @@ the following case-insensitive options:
  
   
 
+  
+ preActions
+ 
+   You can specify custom queries which you want to run before reading 
data from JDBC or writing data to JDBC. It is called per DataFrame, not per 
session.
+ 
+  
+
+  
+ postActions
+ 
+This is a JDBC writer related option. You can specify custom queries 
which you want to run after writing data to JDBC. It is called per DataFrame, 
not per session.

Review comment:
   @moomindani Same question for post action.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


dilipbiswal commented on a change in pull request #28953:
URL: https://github.com/apache/spark/pull/28953#discussion_r447427697



##
File path: docs/sql-data-sources-jdbc.md
##
@@ -156,6 +156,20 @@ the following case-insensitive options:
  
   
 
+  
+ preActions
+ 
+   You can specify custom queries which you want to run before reading 
data from JDBC or writing data to JDBC. It is called per DataFrame, not per 
session.

Review comment:
   @moomindani Do we allow multiple pre actions ? Or we allow just one ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on a change in pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation

2020-06-29 Thread GitBox


dilipbiswal commented on a change in pull request #28951:
URL: https://github.com/apache/spark/pull/28951#discussion_r447426571



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
##
@@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers {
   }
 }
   }
+
+  test("SPARK-32131 Fix wrong column index when we have more than two columns" 
+
+" during union and set operations" ) {
+val firstTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val secondTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", TimestampType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val thirdTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", TimestampType)(),
+  AttributeReference("d", FloatType)())
+
+val fourthTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", TimestampType)())
+
+val a1 = firstTable.output(0)
+val b1 = firstTable.output(1)
+val c1 = firstTable.output(2)
+val d1 = firstTable.output(3)
+
+val a2 = secondTable.output(0)
+val b2 = secondTable.output(1)
+val c2 = secondTable.output(2)
+val d2 = secondTable.output(3)
+
+val a3 = thirdTable.output(0)

Review comment:
   ditto

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
##
@@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers {
   }
 }
   }
+
+  test("SPARK-32131 Fix wrong column index when we have more than two columns" 
+
+" during union and set operations" ) {
+val firstTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val secondTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", TimestampType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val thirdTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", TimestampType)(),
+  AttributeReference("d", FloatType)())
+
+val fourthTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", TimestampType)())
+
+val a1 = firstTable.output(0)
+val b1 = firstTable.output(1)
+val c1 = firstTable.output(2)
+val d1 = firstTable.output(3)
+
+val a2 = secondTable.output(0)
+val b2 = secondTable.output(1)
+val c2 = secondTable.output(2)
+val d2 = secondTable.output(3)
+
+val a3 = thirdTable.output(0)
+val b3 = thirdTable.output(1)
+val c3 = thirdTable.output(2)
+val d3 = thirdTable.output(3)
+
+val a4 = fourthTable.output(0)

Review comment:
   ditto





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on a change in pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation

2020-06-29 Thread GitBox


dilipbiswal commented on a change in pull request #28951:
URL: https://github.com/apache/spark/pull/28951#discussion_r447426485



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
##
@@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers {
   }
 }
   }
+
+  test("SPARK-32131 Fix wrong column index when we have more than two columns" 
+
+" during union and set operations" ) {
+val firstTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val secondTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", TimestampType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val thirdTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", TimestampType)(),
+  AttributeReference("d", FloatType)())
+
+val fourthTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", TimestampType)())
+
+val a1 = firstTable.output(0)

Review comment:
   @GuoPhilipse Variables a1, b1, c1, d1 not used ? Were you planning to 
use it in the test later ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on a change in pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation

2020-06-29 Thread GitBox


dilipbiswal commented on a change in pull request #28951:
URL: https://github.com/apache/spark/pull/28951#discussion_r447426527



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
##
@@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers {
   }
 }
   }
+
+  test("SPARK-32131 Fix wrong column index when we have more than two columns" 
+
+" during union and set operations" ) {
+val firstTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val secondTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", TimestampType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", FloatType)())
+
+val thirdTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", TimestampType)(),
+  AttributeReference("d", FloatType)())
+
+val fourthTable = LocalRelation(
+  AttributeReference("a", StringType)(),
+  AttributeReference("b", DoubleType)(),
+  AttributeReference("c", IntegerType)(),
+  AttributeReference("d", TimestampType)())
+
+val a1 = firstTable.output(0)
+val b1 = firstTable.output(1)
+val c1 = firstTable.output(2)
+val d1 = firstTable.output(3)
+
+val a2 = secondTable.output(0)

Review comment:
   ditto





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] liancheng commented on a change in pull request #28948: [SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options

2020-06-29 Thread GitBox


liancheng commented on a change in pull request #28948:
URL: https://github.com/apache/spark/pull/28948#discussion_r447426398



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##
@@ -248,12 +248,17 @@ class CacheManager extends Logging with 
AdaptiveSparkPlanHelper {
* `HadoopFsRelation` node(s) as part of its logical plan.
*/
   def recacheByPath(spark: SparkSession, resourcePath: String): Unit = {

Review comment:
   That also works, but it feels a little bit weird to couple the data 
source options concept with the cache manager...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning

2020-06-29 Thread GitBox


imback82 commented on a change in pull request #28676:
URL: https://github.com/apache/spark/pull/28676#discussion_r447418040



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##
@@ -554,7 +554,7 @@ class AdaptiveQueryExecSuite
   val smj = findTopLevelSortMergeJoin(plan)
   assert(smj.size == 2)
   val smj2 = findTopLevelSortMergeJoin(adaptivePlan)
-  assert(smj2.size == 2, origPlan.toString)
+  assert(smj2.size == 1, origPlan.toString)

Review comment:
   @cloud-fan (the initial tagging didn't work), do you have a suggestion 
on an example where a plan change could introduce a greater cost?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning

2020-06-29 Thread GitBox


imback82 commented on a change in pull request #28676:
URL: https://github.com/apache/spark/pull/28676#discussion_r447411442



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##
@@ -60,6 +62,92 @@ case class BroadcastHashJoinExec(
 }
   }
 
+  override def outputPartitioning: Partitioning = {
+val (buildKeys, streamedKeys) = buildSide match {
+  case BuildLeft => (leftKeys, rightKeys)
+  case BuildRight => (rightKeys, leftKeys)
+}
+
+joinType match {
+  case _: InnerLike =>
+streamedPlan.outputPartitioning match {
+  case h: HashPartitioning =>
+getBuildSidePartitioning(h, streamedKeys, buildKeys) match {
+  case Some(p) => PartitioningCollection(Seq(h, p))
+  case None => h
+}
+  case c: PartitioningCollection =>
+c.partitionings.foreach {
+  case h: HashPartitioning =>
+getBuildSidePartitioning(h, streamedKeys, buildKeys) match {
+  case Some(p) => return 
PartitioningCollection(c.partitionings :+ p)
+  case None => ()
+}
+  case _ => ()
+}
+c
+  case other => other
+}
+  case _ => streamedPlan.outputPartitioning
+}
+  }
+
+  /**

Review comment:
   In that case, I can revert back to this 
[commit](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2).
 Do you see anything missing from that commit? 
[This](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2#r434999159)
 seems fine right (meaning we can rely on the streamed side's output 
partitioning)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning

2020-06-29 Thread GitBox


imback82 commented on a change in pull request #28676:
URL: https://github.com/apache/spark/pull/28676#discussion_r447411442



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##
@@ -60,6 +62,92 @@ case class BroadcastHashJoinExec(
 }
   }
 
+  override def outputPartitioning: Partitioning = {
+val (buildKeys, streamedKeys) = buildSide match {
+  case BuildLeft => (leftKeys, rightKeys)
+  case BuildRight => (rightKeys, leftKeys)
+}
+
+joinType match {
+  case _: InnerLike =>
+streamedPlan.outputPartitioning match {
+  case h: HashPartitioning =>
+getBuildSidePartitioning(h, streamedKeys, buildKeys) match {
+  case Some(p) => PartitioningCollection(Seq(h, p))
+  case None => h
+}
+  case c: PartitioningCollection =>
+c.partitionings.foreach {
+  case h: HashPartitioning =>
+getBuildSidePartitioning(h, streamedKeys, buildKeys) match {
+  case Some(p) => return 
PartitioningCollection(c.partitionings :+ p)
+  case None => ()
+}
+  case _ => ()
+}
+c
+  case other => other
+}
+  case _ => streamedPlan.outputPartitioning
+}
+  }
+
+  /**

Review comment:
   In that case, I can revert back to this 
[commit](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2).
 Do you see anything missing from that commit? 
[This](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2#r434999159)
 seems fine right?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28916: [SPARK-32083][SQL] Coalesce to one partition when all partitions are empty in AQE

2020-06-29 Thread GitBox


cloud-fan commented on pull request #28916:
URL: https://github.com/apache/spark/pull/28916#issuecomment-651527077


   I think the key problem is we skip `CoalesceShufflePartitions` when 
`ShuffleQueryStageExec#mapStats` is None. This can happen when the input RDD of 
the shuffle has 0 partitions. I think we should still apply 
`CoalesceShufflePartitions` in this case and wrap `ShuffleQueryStageExec` with 
`CustomShuffleReaderExec` with `partitionSpecs` as Nil.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28940: [SPARK-32121][SHUFFLE][TEST] Fix ExternalShuffleBlockResolverSuite failed on Windows

2020-06-29 Thread GitBox


HyukjinKwon commented on pull request #28940:
URL: https://github.com/apache/spark/pull/28940#issuecomment-651517591


   Build started: [CORE] 
`org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite` 
[![PR-28940](https://ci.appveyor.com/api/projects/status/github/HyukjinKwon/spark?branch=92C9A950-6909-4EB4-93E1-18523B43DF46=true)](https://ci.appveyor.com/project/HyukjinKwon/spark/branch/92C9A950-6909-4EB4-93E1-18523B43DF46)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #28952: [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and sql when AQE is enabled

2020-06-29 Thread GitBox


viirya commented on pull request #28952:
URL: https://github.com/apache/spark/pull/28952#issuecomment-651516656


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.

2020-06-29 Thread GitBox


beliefer commented on a change in pull request #28917:
URL: https://github.com/apache/spark/pull/28917#discussion_r447395194



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -278,7 +280,26 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 
   override def beforeEach(): Unit = {
 super.beforeEach()
-init(new SparkConf())
+  }
+
+  override protected def test(testName: String, testTags: Tag*)(testFun: => 
Any)
+  (implicit pos: Position): Unit = {
+testWithSparkConf(testName, testTags: _*)()(testFun)(pos)
+  }
+
+  private def testWithSparkConf(testName: String, testTags: Tag*)
+  (pairs: (String, String)*)(testFun: => Any)(implicit pos: Position): 
Unit = {
+super.test(testName, testTags: _*) {
+  withSparkConf(pairs: _*)(testFun)
+}
+  }
+
+  /** Sets all configurations specified in `pairs`, calls `init`, and then 
calls `testFun` */
+  private def withSparkConf(pairs: (String, String)*)(testFun: => Any): Unit = 
{
+val conf = new SparkConf()
+pairs.foreach(kv => conf.set(kv._1, kv._2))
+init(conf)

Review comment:
   OK.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.

2020-06-29 Thread GitBox


beliefer commented on a change in pull request #28917:
URL: https://github.com/apache/spark/pull/28917#discussion_r447395077



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -278,7 +280,26 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 
   override def beforeEach(): Unit = {
 super.beforeEach()
-init(new SparkConf())
+  }
+
+  override protected def test(testName: String, testTags: Tag*)(testFun: => 
Any)
+  (implicit pos: Position): Unit = {
+testWithSparkConf(testName, testTags: _*)()(testFun)(pos)
+  }
+
+  private def testWithSparkConf(testName: String, testTags: Tag*)

Review comment:
   Most SQL-related configuration parameters can be changed dynamically, 
but most of Core's parameters are static.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation

2020-06-29 Thread GitBox


HyukjinKwon commented on pull request #28951:
URL: https://github.com/apache/spark/pull/28951#issuecomment-651512938


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1

2020-06-29 Thread GitBox


HyukjinKwon commented on pull request #28950:
URL: https://github.com/apache/spark/pull/28950#issuecomment-651512488


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1

2020-06-29 Thread GitBox


HyukjinKwon commented on pull request #28950:
URL: https://github.com/apache/spark/pull/28950#issuecomment-651511487


   Yeah, to upgrade we should drop Python 2. I target to drop it in Spark 3.1. 
I will make a PR to officially drop first.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kiszk commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


kiszk commented on a change in pull request #28953:
URL: https://github.com/apache/spark/pull/28953#discussion_r447389935



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
##
@@ -156,9 +157,16 @@ object JDBCRDD extends Logging {
 val url = options.url
 val dialect = JdbcDialects.get(url)
 val quotedColumns = requiredColumns.map(colName => 
dialect.quoteIdentifier(colName))
+
+val conn = JdbcUtils.createConnectionFactory(options)
+options.preActions match {
+  case Some(i) =>
+runQuery(conn(), i, options)

Review comment:
   nit: `conn()` -> `conn` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] erenavsarogullari edited a comment on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite

2020-06-29 Thread GitBox


erenavsarogullari edited a comment on pull request #28865:
URL: https://github.com/apache/spark/pull/28865#issuecomment-651498167


   Thanks @dongjoon-hyun for the review. All comments are addressed. I think it 
is ready to go.
   Also, we plan to use Prometheus + Grafana with proposed format change 
(through custom release). It may still be useful for the long-term if we plan 
to revisit the current format in the future.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] erenavsarogullari edited a comment on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite

2020-06-29 Thread GitBox


erenavsarogullari edited a comment on pull request #28865:
URL: https://github.com/apache/spark/pull/28865#issuecomment-651498167


   Thanks @dongjoon-hyun for the review. All comments are addressed. I think it 
is ready to go.
   Also, we plan to use Prometheus + Grafana with proposed format change. It 
may still be useful for the long-term if we plan to revisit the current format 
in the future.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsExceptio

2020-06-29 Thread GitBox


turboFei commented on pull request #26339:
URL: https://github.com/apache/spark/pull/26339#issuecomment-651498450


   Gentle ping @dongjoon-hyun  @dbtsai 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] erenavsarogullari commented on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite

2020-06-29 Thread GitBox


erenavsarogullari commented on pull request #28865:
URL: https://github.com/apache/spark/pull/28865#issuecomment-651498167


   Thanks @dongjoon-hyun for the review. All comments are addressed. I think it 
is ready to go.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] erenavsarogullari commented on a change in pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite

2020-06-29 Thread GitBox


erenavsarogullari commented on a change in pull request #28865:
URL: https://github.com/apache/spark/pull/28865#discussion_r447380991



##
File path: 
core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry}
+
+import org.scalatest.PrivateMethodTester
+import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite}
+
+
+class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester {
+
+  test("PrometheusServletSuite registered metrics test") {
+val sink = createPrometheusServlet()
+
+val gauge = new Gauge[Double] {
+  override def getValue: Double = 5.0
+}
+
+val counter = new Counter
+counter.inc(10)
+
+sink.registry.register("gauge1", gauge)
+sink.registry.register("gauge2", gauge)
+sink.registry.register("counter1", counter)
+
+val metricGaugeKeys = sink.registry.getGauges.keySet.asScala
+assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")),
+  "Should contain 2 gauges metrics registered")
+
+val metricCounterKeys = sink.registry.getCounters.keySet.asScala
+assert(metricCounterKeys.equals(Set("counter1")),
+  "Should contain 1 counter metric registered")
+
+val gaugeValues = sink.registry.getGauges.values.asScala
+assert(gaugeValues.size == 2)
+gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0))
+
+val counterValues = sink.registry.getCounters.values.asScala
+assert(counterValues.size == 1)
+counterValues.foreach(counter => assert(counter.getCount == 10))
+  }
+
+  test("PrometheusServletSuite private normalizeKey function test") {
+val key = "local-1592132938718.driver.LiveListenerBus." +
+  "listenerProcessingTime.org.apache.spark.HeartbeatReceiver"
+val sink = createPrometheusServlet()
+val suffix = sink invokePrivate PrivateMethod[String]('normalizeKey)(key)
+assert(suffix == "metrics_local_1592132938718_driver_LiveListenerBus_" +
+  "listenerProcessingTime_org_apache_spark_HeartbeatReceiver_")
+  }
+
+  private def createPrometheusServlet(): PrometheusServlet = {
+val props = new Properties
+props.put("host", "127.0.0.1")
+props.put("port", "12340")

Review comment:
   Actually, both host and port seems not required for the current UTs so 
removed them.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsExcep

2020-06-29 Thread GitBox


LuciferYang commented on pull request #26339:
URL: https://github.com/apache/spark/pull/26339#issuecomment-651497369


   @dongjoon-hyun @turboFei Is this PR still being worked on? We are having 
similar issues in our production environment, and I found there are similar PRs 
try to solve this problem, such as #26090, #26971 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651484627


   (IMHO it might be still good chance to leverage this PR to construct a good 
way for versioning properly - so that version 2 can be used as an interim with 
best practice on versioning, and we get version 3 with such major features 
included without headache on versioning.)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-29 Thread GitBox


dongjoon-hyun commented on pull request #28708:
URL: https://github.com/apache/spark/pull/28708#issuecomment-651487745


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite

2020-06-29 Thread GitBox


dongjoon-hyun commented on pull request #28865:
URL: https://github.com/apache/spark/pull/28865#issuecomment-651487054


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1

2020-06-29 Thread GitBox


dongjoon-hyun commented on pull request #28950:
URL: https://github.com/apache/spark/pull/28950#issuecomment-651485642


   +1 for @holdenk 's advice.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-29 Thread GitBox


sarutak commented on pull request #28942:
URL: https://github.com/apache/spark/pull/28942#issuecomment-651485037


   ok to test.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-29 Thread GitBox


AmplabJenkins removed a comment on pull request #28942:
URL: https://github.com/apache/spark/pull/28942#issuecomment-650836727


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sarutak commented on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-29 Thread GitBox


sarutak commented on pull request #28942:
URL: https://github.com/apache/spark/pull/28942#issuecomment-651485158


   cc: @squito 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR commented on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651484627


   (IMHO it might be still good to leverage this PR to be a chance to construct 
a good way for versioning properly - so that version 2 can be used as an 
interim with best practice on versioning, and we get version 3 with such major 
features included without headache on versioning.)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a change in pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.

2020-06-29 Thread GitBox


beliefer commented on a change in pull request #28917:
URL: https://github.com/apache/spark/pull/28917#discussion_r447372063



##
File path: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
##
@@ -278,7 +280,26 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 
   override def beforeEach(): Unit = {
 super.beforeEach()
-init(new SparkConf())
+  }
+
+  override protected def test(testName: String, testTags: Tag*)(testFun: => 
Any)
+  (implicit pos: Position): Unit = {
+testWithSparkConf(testName, testTags: _*)()(testFun)(pos)
+  }
+
+  private def testWithSparkConf(testName: String, testTags: Tag*)
+  (pairs: (String, String)*)(testFun: => Any)(implicit pos: Position): 
Unit = {
+super.test(testName, testTags: _*) {
+  withSparkConf(pairs: _*)(testFun)
+}
+  }
+
+  /** Sets all configurations specified in `pairs`, calls `init`, and then 
calls `testFun` */
+  private def withSparkConf(pairs: (String, String)*)(testFun: => Any): Unit = 
{

Review comment:
   OK. I will simulate `SQLHelper`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign metadata log as 
well as file stream source itself, which wasn't a goal actually. As I commented 
previously, building a holistic solution was not a goal, because I already 
indicated it should take considerable time, and someone might claim that it's 
reinventing the wheel (I know these functionalities are extracted from 
alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade, 
**before proceeding 2**.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case 
instead of thinking too general.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign metadata log as 
well as file stream source itself, which wasn't a goal actually. As I commented 
previously, building a holistic solution was not a goal, because I already 
indicated it should take considerable time, and someone might claim that it's 
reinventing the wheel (I know these functionalities are extracted from 
alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case 
instead of thinking too general.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case 
instead of thinking too general.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirement altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll persuade my employer to 
allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll persuade my employer to 
allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR commented on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel.
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirement altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll persuade my employer to 
allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector

2020-06-29 Thread GitBox


maropu commented on pull request #28863:
URL: https://github.com/apache/spark/pull/28863#issuecomment-651478897







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-29 Thread GitBox


holdenk commented on a change in pull request #28864:
URL: https://github.com/apache/spark/pull/28864#discussion_r447366211



##
File path: 
resource-managers/mesos/src/main/scala/org/apache/spark/executor/MesosExecutorBackend.scala
##
@@ -48,7 +46,7 @@ private[spark] class MesosExecutorBackend
 val mesosTaskId = TaskID.newBuilder().setValue(taskId.toString).build()
 driver.sendStatusUpdate(MesosTaskStatus.newBuilder()
   .setTaskId(mesosTaskId)
-  .setState(taskStateToMesos(state))
+  .setState(MesosSchedulerBackendUtil.taskStateToMesos(state))

Review comment:
   Yeah, I was in the area and noticed a TODO I could resolve quickly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


maropu commented on pull request #28953:
URL: https://github.com/apache/spark/pull/28953#issuecomment-651469890


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


maropu commented on a change in pull request #28953:
URL: https://github.com/apache/spark/pull/28953#discussion_r447359906



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala
##
@@ -574,6 +576,41 @@ class JDBCWriteSuite extends SharedSparkSession with 
BeforeAndAfter {
 }
   }
 
+  test("SPARK-32013: option preActions/postActions, run SQL before writing 
data.") {

Review comment:
   nit: Basically, we don't need the prefix, e.g., SPARK-32013, when adding 
new features.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani edited a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


moomindani edited a comment on pull request #28953:
URL: https://github.com/apache/spark/pull/28953#issuecomment-651467865


   @dilipbiswal Sure I added it in this PR description.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LantaoJin commented on pull request #28935: [SPARK-20680][SQL] Adding HiveVoidType in Spark to be compatible with Hive

2020-06-29 Thread GitBox


LantaoJin commented on pull request #28935:
URL: https://github.com/apache/spark/pull/28935#issuecomment-651469114


   @cloud-fan I refactor some codes, now I think this PR could be no dependency.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani commented on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


moomindani commented on pull request #28953:
URL: https://github.com/apache/spark/pull/28953#issuecomment-651467865


   @dilipbiswal Sure I will add it in this PR description.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


dilipbiswal commented on pull request #28953:
URL: https://github.com/apache/spark/pull/28953#issuecomment-651466295


   @moomindani Thanks. Could we illustrate the usage of these two options via 
examples in the PR description ? I think, it will help the reviewers.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on pull request #28943: [SPARK-32127][SQL]: Check rules for MERGE INTO should use MergeAction.conditition other than MergeAction.children

2020-06-29 Thread GitBox


xianyinxin commented on pull request #28943:
URL: https://github.com/apache/spark/pull/28943#issuecomment-651465152


   Thanks @cloud-fan !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xianyinxin commented on pull request #28875: [SPARK-32030][SPARK-32127][SQL] Support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO

2020-06-29 Thread GitBox


xianyinxin commented on pull request #28875:
URL: https://github.com/apache/spark/pull/28875#issuecomment-651465114


   Thanks @cloud-fan !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moomindani opened a new pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC

2020-06-29 Thread GitBox


moomindani opened a new pull request #28953:
URL: https://github.com/apache/spark/pull/28953


   ### What changes were proposed in this pull request?
   
   
   This pull request is to support query execution before/after reading/writing 
over JDBC.
   There are two new options; `preActions` and `postActions` in DataFrame's 
JDBC configuration.
   
   SQL statements specified in `preActions` will be executed before 
reading/writing DataFrame via JDBC.
   SQL statements specified in `postActions` will be executed after writing 
DataFrame via JDBC.
   
   Note: `postActions` is only supported in JDBC writer, not in JDBC reader. It 
is because it won't be needed so often and I was not able to find good places 
to implement this.
   
   ### Why are the changes needed?
   
   
   For ETL workload, there is a common requirement to perform SQL statement 
before/after reading/writing over JDBC.
   Here's examples;
   - Create a view with specific conditions
   - Delete/Update some records
   - Truncate a table (it is already possible in `truncate` option)
   - Execute stored procedure (it is also requested in SPARK-32014)
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   With this feature, users can run any SQL statements before/after 
reading/writing DataFrame over JDBC.
   
   It does not affect any existing behavior. It just adds new options to use 
this new feature.
   
   ### How was this patch tested?
   
   
   I added test cases into `JDBCSuite.scala` and `JDBCWriteSuite.scala`, and 
confirmed all the tests have been passed.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tharradine commented on pull request #28946: [SPARK-32123][PYSPARK] Setting `spark.sql.session.timeZone` only partially respected

2020-06-29 Thread GitBox


tharradine commented on pull request #28946:
URL: https://github.com/apache/spark/pull/28946#issuecomment-651455471


   This isn't exactly the SPARK-32123 fix I was expecting, I was expecting the 
behaviour mentioned in the 
[docs](https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#timestamp-with-time-zone-semantics)
 to be adhered to, specifically:
   
   > When timestamp data is exported or displayed in Spark, the session time 
zone is used to localize the timestamp values.
   
   Is it not preferable to perform this localization whilst converting from/to 
internal timestamp datatypes, and preserve the documented behaviour?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


zsxwing commented on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651454246


   The numbers are pretty impressive. Thanks a lot for your work.
   
   My high level comments regarding the PR:
   
   - The compression codec should not be hardcoded. It's better to allow the 
user to config the writer codec like other codec configs in Spark. The reader 
should be able to identify the codec and load files correctly. Otherwise, we 
would need to bump the version again when adding more codec support.
   - The cost of bumping file sink version is much higher than the checkpoint. 
The file sink metadata should be compatible with old versions if possible like 
other storage formats such as parquet, orc.
 - For example, in some company, the team who generates the output may not 
be able to convince the downstream to upgrade their Spark version.
 - The file sink metadata can be read by an external system. Bumping the 
version will need the ecosystem to build a new reader.
 - If a user hits a critical issue in new Spark version, they may want to 
rollback. Bumping the version will break this workflow.
   - It's better to make the default behavior like this:
 - Use v1 if an existing directory is using v1, so that we can still  
support rollback. The user can set a flag to enable v2 explicitly.
 - Use v2 if it's a new directory.
   
   I'm wondering if we can step back to think about whether it is possible 
solve the issue without bumping the version. IMO, the major issue is the number 
of files increases infinitely over time. This PR increase the upper bound but 
doesn't change the complexity. Maybe we should try to see how to reduce the 
number of files, such as supporting data compaction? You may notice that 
`FileStreamSinkLog.DELETE_ACTION` is never used. This is because it was added 
to support data compaction but we didn't implement it yet.
   
   I totally understand that we may not be able to solve the metadata issue 
without bumping the file metadata version. Then it's better to design a better 
file sink log format to solve all of the following issues so that we don't need 
to bump the version again in future:
   
   - Data compaction support.
   - Open source format. It would be great that the new file sink log format 
can be read by Spark and other systems. An extra benefit is we can use Spark 
itself to process the metadata so that we don't have to put all metadata in the 
driver memory.
   - Better file stream source support. Since we always append files to the 
file sink, when reading a file sink using streaming queries, it would be great 
we can locate the new appended files quickly. For example, we can use `log file 
name + line number` as the streaming source offset instead, then we can jump to 
the log file directly.
   - Filter push down support when reading a file sink directory. For example, 
if a user queries only one partition, we don't need to load metadata of other 
partitions into the driver memory.
   - Decouple the file sink from the streaming query. Currently, one file sink 
output always maps to one streaming query. The user may want to use another 
streaming query or even batch query to write to the same file sink. But as we 
use the batch id as the file name, this is not supported.
   - Use relative path. Copying the file sink directory will break it today. It 
would be great that we remember the relative paths so that people can move or 
copy the directory.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector

2020-06-29 Thread GitBox


dongjoon-hyun commented on pull request #28863:
URL: https://github.com/apache/spark/pull/28863#issuecomment-651449133


   Hi, @gaborgsomogyi . Is `OracleKrbIntegrationSuite` missing here?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector

2020-06-29 Thread GitBox


AmplabJenkins removed a comment on pull request #28863:
URL: https://github.com/apache/spark/pull/28863#issuecomment-651138861


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124640/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector

2020-06-29 Thread GitBox


dongjoon-hyun commented on pull request #28863:
URL: https://github.com/apache/spark/pull/28863#issuecomment-651442483


   Retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on a change in pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-29 Thread GitBox


rdblue commented on a change in pull request #28864:
URL: https://github.com/apache/spark/pull/28864#discussion_r447336956



##
File path: 
resource-managers/mesos/src/main/scala/org/apache/spark/executor/MesosExecutorBackend.scala
##
@@ -48,7 +46,7 @@ private[spark] class MesosExecutorBackend
 val mesosTaskId = TaskID.newBuilder().setValue(taskId.toString).build()
 driver.sendStatusUpdate(MesosTaskStatus.newBuilder()
   .setTaskId(mesosTaskId)
-  .setState(taskStateToMesos(state))
+  .setState(MesosSchedulerBackendUtil.taskStateToMesos(state))

Review comment:
   Was this method moved to implement the TODO item? This looks unrelated 
to the rename changes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-29 Thread GitBox


rdblue commented on pull request #28864:
URL: https://github.com/apache/spark/pull/28864#issuecomment-651442680


   The updates look good. +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #27963: [SPARK-31199]separate shuffle io connect timeout from idle timeout

2020-06-29 Thread GitBox


github-actions[bot] commented on pull request #27963:
URL: https://github.com/apache/spark/pull/27963#issuecomment-651439774


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #27971: [SPARK-31206][SQL] AQE should not use the same SubqueryExec when reuse is off

2020-06-29 Thread GitBox


github-actions[bot] closed pull request #27971:
URL: https://github.com/apache/spark/pull/27971


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions

2020-06-29 Thread GitBox


github-actions[bot] closed pull request #24939:
URL: https://github.com/apache/spark/pull/24939


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite

2020-06-29 Thread GitBox


dongjoon-hyun commented on pull request #28865:
URL: https://github.com/apache/spark/pull/28865#issuecomment-651438818


   Also, please update the PR description consistently together. Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage

2020-06-29 Thread GitBox


dongjoon-hyun commented on a change in pull request #28865:
URL: https://github.com/apache/spark/pull/28865#discussion_r447332799



##
File path: 
core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry}
+
+import org.scalatest.PrivateMethodTester
+import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite}
+
+
+class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester {
+
+  test("PrometheusServletSuite registered metrics test") {
+val sink = createPrometheusServlet()
+
+val gauge = new Gauge[Double] {
+  override def getValue: Double = 5.0
+}
+
+val counter = new Counter
+counter.inc(10)
+
+sink.registry.register("gauge1", gauge)
+sink.registry.register("gauge2", gauge)
+sink.registry.register("counter1", counter)
+
+val metricGaugeKeys = sink.registry.getGauges.keySet.asScala
+assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")),
+  "Should contain 2 gauges metrics registered")
+
+val metricCounterKeys = sink.registry.getCounters.keySet.asScala
+assert(metricCounterKeys.equals(Set("counter1")),
+  "Should contain 1 counter metric registered")
+
+val gaugeValues = sink.registry.getGauges.values.asScala
+assert(gaugeValues.size == 2)
+gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0))
+
+val counterValues = sink.registry.getCounters.values.asScala
+assert(counterValues.size == 1)
+counterValues.foreach(counter => assert(counter.getCount == 10))
+  }
+
+  test("PrometheusServletSuite private normalizeKey function test") {

Review comment:
   `normalizeKey` might be enough for the test name in this context.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage

2020-06-29 Thread GitBox


dongjoon-hyun commented on a change in pull request #28865:
URL: https://github.com/apache/spark/pull/28865#discussion_r447332927



##
File path: 
core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry}
+
+import org.scalatest.PrivateMethodTester
+import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite}
+
+
+class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester {
+
+  test("PrometheusServletSuite registered metrics test") {
+val sink = createPrometheusServlet()
+
+val gauge = new Gauge[Double] {
+  override def getValue: Double = 5.0
+}
+
+val counter = new Counter
+counter.inc(10)
+
+sink.registry.register("gauge1", gauge)
+sink.registry.register("gauge2", gauge)
+sink.registry.register("counter1", counter)
+
+val metricGaugeKeys = sink.registry.getGauges.keySet.asScala
+assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")),
+  "Should contain 2 gauges metrics registered")
+
+val metricCounterKeys = sink.registry.getCounters.keySet.asScala
+assert(metricCounterKeys.equals(Set("counter1")),
+  "Should contain 1 counter metric registered")
+
+val gaugeValues = sink.registry.getGauges.values.asScala
+assert(gaugeValues.size == 2)
+gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0))
+
+val counterValues = sink.registry.getCounters.values.asScala
+assert(counterValues.size == 1)
+counterValues.foreach(counter => assert(counter.getCount == 10))
+  }
+
+  test("PrometheusServletSuite private normalizeKey function test") {
+val key = "local-1592132938718.driver.LiveListenerBus." +
+  "listenerProcessingTime.org.apache.spark.HeartbeatReceiver"
+val sink = createPrometheusServlet()
+val suffix = sink invokePrivate PrivateMethod[String]('normalizeKey)(key)
+assert(suffix == "metrics_local_1592132938718_driver_LiveListenerBus_" +
+  "listenerProcessingTime_org_apache_spark_HeartbeatReceiver_")
+  }
+
+  private def createPrometheusServlet(): PrometheusServlet = {
+val props = new Properties
+props.put("host", "127.0.0.1")
+props.put("port", "12340")
+val registry = new MetricRegistry
+val securityMgr = new SecurityManager(new SparkConf(false))
+new PrometheusServlet(props, registry, securityMgr)
+  }
+

Review comment:
   Shall we remove empty line?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage

2020-06-29 Thread GitBox


dongjoon-hyun commented on a change in pull request #28865:
URL: https://github.com/apache/spark/pull/28865#discussion_r447332338



##
File path: 
core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry}
+
+import org.scalatest.PrivateMethodTester
+import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite}
+
+
+class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester {
+
+  test("PrometheusServletSuite registered metrics test") {

Review comment:
   Instead of using test suite name, `PrometheusServletSuite`, shall we 
simple put "registered metrics" in the test name?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage

2020-06-29 Thread GitBox


dongjoon-hyun commented on a change in pull request #28865:
URL: https://github.com/apache/spark/pull/28865#discussion_r447331955



##
File path: 
core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics.sink
+
+import java.util.Properties
+
+import scala.collection.JavaConverters._
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry}
+
+import org.scalatest.PrivateMethodTester
+import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite}
+
+
+class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester {
+
+  test("PrometheusServletSuite registered metrics test") {
+val sink = createPrometheusServlet()
+
+val gauge = new Gauge[Double] {
+  override def getValue: Double = 5.0
+}
+
+val counter = new Counter
+counter.inc(10)
+
+sink.registry.register("gauge1", gauge)
+sink.registry.register("gauge2", gauge)
+sink.registry.register("counter1", counter)
+
+val metricGaugeKeys = sink.registry.getGauges.keySet.asScala
+assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")),
+  "Should contain 2 gauges metrics registered")
+
+val metricCounterKeys = sink.registry.getCounters.keySet.asScala
+assert(metricCounterKeys.equals(Set("counter1")),
+  "Should contain 1 counter metric registered")
+
+val gaugeValues = sink.registry.getGauges.values.asScala
+assert(gaugeValues.size == 2)
+gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0))
+
+val counterValues = sink.registry.getCounters.values.asScala
+assert(counterValues.size == 1)
+counterValues.foreach(counter => assert(counter.getCount == 10))
+  }
+
+  test("PrometheusServletSuite private normalizeKey function test") {
+val key = "local-1592132938718.driver.LiveListenerBus." +
+  "listenerProcessingTime.org.apache.spark.HeartbeatReceiver"
+val sink = createPrometheusServlet()
+val suffix = sink invokePrivate PrivateMethod[String]('normalizeKey)(key)
+assert(suffix == "metrics_local_1592132938718_driver_LiveListenerBus_" +
+  "listenerProcessingTime_org_apache_spark_HeartbeatReceiver_")
+  }
+
+  private def createPrometheusServlet(): PrometheusServlet = {
+val props = new Properties
+props.put("host", "127.0.0.1")
+props.put("port", "12340")

Review comment:
   Is the fixed port number safe in the concurrent testing? In Apache Spark 
Jenkins farm, multiple testing jobs are running concurrently.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage

2020-06-29 Thread GitBox


dongjoon-hyun commented on pull request #28865:
URL: https://github.com/apache/spark/pull/28865#issuecomment-651435667


   Retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LantaoJin commented on pull request #28947: [SPARK-32129][SQL] Support AQE skew join with Union

2020-06-29 Thread GitBox


LantaoJin commented on pull request #28947:
URL: https://github.com/apache/spark/pull/28947#issuecomment-651435088


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] TJX2014 commented on a change in pull request #28882: [SPARK-31751][SQL]Serde property `path` overwrites hive table property location

2020-06-29 Thread GitBox


TJX2014 commented on a change in pull request #28882:
URL: https://github.com/apache/spark/pull/28882#discussion_r447323119



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
##
@@ -218,4 +219,26 @@ class HiveExternalCatalogSuite extends 
ExternalCatalogSuite {
 val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
 assert(alteredTable.provider === Some("foo"))
   }
+
+  test("SPARK-31751: serde property `path` overwrites hive table property 
location") {
+val catalog = newBasicCatalog()
+val hiveTable = CatalogTable(
+  identifier = TableIdentifier("parq_alter", Some("db1")),
+  tableType = CatalogTableType.MANAGED,
+  storage = storageFormat,
+  schema = new StructType().add("col1", "int"),
+  provider = Some("parquet"))
+catalog.createTable(hiveTable, ignoreIfExists = false)
+val beforeAlterTable = externalCatalog.getTable("db1", "parq_alter")
+
assert(beforeAlterTable.storage.locationUri.toString.contains("parq_alter"))
+
+externalCatalog.client.runSqlHive(
+  "alter table db1.parq_alter rename to db1.parq_alter2")
+
+val e = intercept[AnalysisException](
+  externalCatalog.getTable("db1", "parq_alter2")
+)
+assert(e.getMessage.contains("not equal to table prop path")
+  && e.getMessage.contains("parq_alter2"))
+  }

Review comment:
   We will get an exception when the path property is not consistent with 
storage location.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #28952: [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and sql when AQE is enabled

2020-06-29 Thread GitBox


viirya commented on pull request #28952:
URL: https://github.com/apache/spark/pull/28952#issuecomment-651432458


   I also think this might be worth creating a new jira ticket, but as 
initially we discussed it as follow-up. So I put it as a follow-up first.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya opened a new pull request #28952: [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and sql when AQE is enabled

2020-06-29 Thread GitBox


viirya opened a new pull request #28952:
URL: https://github.com/apache/spark/pull/28952


   
   
   ### What changes were proposed in this pull request?
   
   
   As the followup of #28900, this patch extends coalescing partitions to 
repartitioning using hints and SQL syntax without specifying number of 
partitions, when AQE is enabled.
   
   ### Why are the changes needed?
   
   
   When repartitionning using hints and SQL syntax, we should follow the 
shuffling behavior of repartition by expression/range to coalesce partitions 
when AQE is enabled.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes. After this change, if users don't specify the number of partitions when 
repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE 
BY`/`CLUSTER BY`, AQE will coalesce partitions.
   
   ### How was this patch tested?
   
   
   Unit tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rajatahujaatinmobi commented on a change in pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly

2020-06-29 Thread GitBox


rajatahujaatinmobi commented on a change in pull request #28880:
URL: https://github.com/apache/spark/pull/28880#discussion_r446745968



##
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
##
@@ -211,9 +211,11 @@ private[spark] class ApplicationMaster(
   final def run(): Int = {
 try {
   val attemptID = if (isClusterMode) {
-// Set the web ui port to be ephemeral for yarn so we don't conflict 
with
-// other spark processes running on the same box
-System.setProperty(UI_PORT.key, "0")
+// Set the web ui port to be ephemeral for yarn if not set explicitly
+// so we don't conflict with other spark processes running on the same 
box
+if (System.getProperty(UI_PORT.key) != null) {
+  System.setProperty(UI_PORT.key, "0")
+}

Review comment:
   @tgravescs I have done that. Can we proceed further?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] TJX2014 commented on a change in pull request #28882: [SPARK-31751][SQL]Serde property `path` overwrites hive table property location

2020-06-29 Thread GitBox


TJX2014 commented on a change in pull request #28882:
URL: https://github.com/apache/spark/pull/28882#discussion_r447323119



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
##
@@ -218,4 +219,26 @@ class HiveExternalCatalogSuite extends 
ExternalCatalogSuite {
 val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
 assert(alteredTable.provider === Some("foo"))
   }
+
+  test("SPARK-31751: serde property `path` overwrites hive table property 
location") {
+val catalog = newBasicCatalog()
+val hiveTable = CatalogTable(
+  identifier = TableIdentifier("parq_alter", Some("db1")),
+  tableType = CatalogTableType.MANAGED,
+  storage = storageFormat,
+  schema = new StructType().add("col1", "int"),
+  provider = Some("parquet"))
+catalog.createTable(hiveTable, ignoreIfExists = false)
+val beforeAlterTable = externalCatalog.getTable("db1", "parq_alter")
+
assert(beforeAlterTable.storage.locationUri.toString.contains("parq_alter"))
+
+externalCatalog.client.runSqlHive(
+  "alter table db1.parq_alter rename to db1.parq_alter2")
+
+val e = intercept[AnalysisException](
+  externalCatalog.getTable("db1", "parq_alter2")
+)
+assert(e.getMessage.contains("not equal to table prop path")
+  && e.getMessage.contains("parq_alter2"))
+  }

Review comment:
   We will get a exception when the path property is not consistent with 
storage location.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] TJX2014 commented on pull request #28918: [SPARK-32068][WEBUI] Correct task lauchtime show issue due to timezone in stage tab

2020-06-29 Thread GitBox


TJX2014 commented on pull request #28918:
URL: https://github.com/apache/spark/pull/28918#issuecomment-651425597


   Thanks all for your suggestion and attention very much :-)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] TJX2014 commented on a change in pull request #28926: [SPARK-32133][SQL] Forbid time field steps for date start/end in Sequence

2020-06-29 Thread GitBox


TJX2014 commented on a change in pull request #28926:
URL: https://github.com/apache/spark/pull/28926#discussion_r447317750



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##
@@ -2623,8 +2628,16 @@ object Sequence {
 // about a month length in days and a day length in microseconds
 val intervalStepInMicros =
   stepMicros + stepMonths * microsPerMonth + stepDays * microsPerDay
-val startMicros: Long = num.toLong(start) * scale
-val stopMicros: Long = num.toLong(stop) * scale
+
+// Date to timestamp is not equal from GMT and Chicago timezones

Review comment:
   @cloud-fan Thanks, I have followed the suggestion and make a new jira 
ticket for this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] TJX2014 commented on a change in pull request #28926: [SPARK-32133][SQL] Forbid time field steps for date start/end in Sequence

2020-06-29 Thread GitBox


TJX2014 commented on a change in pull request #28926:
URL: https://github.com/apache/spark/pull/28926#discussion_r447316906



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##
@@ -2612,6 +2614,9 @@ object Sequence {
   val stepDays = step.days
   val stepMicros = step.microseconds
 
+  require(scale != MICROS_PER_DAY || stepMonths != 0 || stepDays != 0,
+"sequence step must be a day interval if start and end values are 
dates")
+
   if (stepMonths == 0 && stepMicros == 0 && scale == MICROS_PER_DAY) {

Review comment:
   Seems we need the `require` check in eval, and I remove `SPARK-32198` 
branch code from this PR, is it ok ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28850: [SPARK-32015][Core]Remote inheritable thread local variables after spark context is stopped

2020-06-29 Thread GitBox


holdenk commented on pull request #28850:
URL: https://github.com/apache/spark/pull/28850#issuecomment-651421291


   I could see this being useful in testing using something like 
`spark-testing-base`, you often want a fresh Spark context but not a whole 
fresh JVM.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-29 Thread GitBox


holdenk commented on pull request #28864:
URL: https://github.com/apache/spark/pull/28864#issuecomment-651420282


   Let me know when you've had a chance @tgravescs :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-29 Thread GitBox


holdenk commented on pull request #28864:
URL: https://github.com/apache/spark/pull/28864#issuecomment-651420495


   > > The only other thing is that the use of the Mesos API stands out. We 
could address that as well. Types could be renamed when imported, or we could 
create subclasses and use those. Similarly, we could create traits that 
implement the Mesos method name and call `agentLost` instead.
   > 
   > Thanks for taking a look :) I think the type renames make sense, thanks 
for suggesting it. I think for the `agentLost` function we could just add it to 
`MesosSchedulerUtils` instead of adding a new trait.
   
   
   Turns out this didn't work out, since we uses MesosSchedulerUtils in some 
non-scheduler places. Added the new trait instead :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28911: [SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled

2020-06-29 Thread GitBox


holdenk commented on a change in pull request #28911:
URL: https://github.com/apache/spark/pull/28911#discussion_r447311636



##
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##
@@ -1391,10 +1391,11 @@ package object config {
 
   private[spark] val SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED =
 ConfigBuilder("spark.shuffle.readHostLocalDisk")
-  .doc(s"If enabled (and `${SHUFFLE_USE_OLD_FETCH_PROTOCOL.key}` is 
disabled and external " +
-s"shuffle `${SHUFFLE_SERVICE_ENABLED.key}` is enabled), shuffle " +
-"blocks requested from those block managers which are running on the 
same host are read " +
-"from the disk directly instead of being fetched as remote blocks over 
the network.")
+  .doc(s"If enabled (and `${SHUFFLE_USE_OLD_FETCH_PROTOCOL.key}` is 
disabled and 1) external " +
+s"shuffle `${SHUFFLE_SERVICE_ENABLED.key}` is enabled or 2) 
${DYN_ALLOCATION_ENABLED.key}" +
+s" is disabled), shuffle blocks requested from those block managers 
which are running on " +
+s"the same host are read from the disk directly instead of being 
fetched as remote blocks" +
+s" over the network.")

Review comment:
   Maybe add some extra words around this depends on having non-isolated 
container storage (e.g. a shared hostPath in the k8s world)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhli1142015 commented on pull request #28949: [SPARK-32028][WEBUI][FOLLOWUP] fix app id link for multi attempts app in history summary page

2020-06-29 Thread GitBox


zhli1142015 commented on pull request #28949:
URL: https://github.com/apache/spark/pull/28949#issuecomment-651418623


   @srowen , thanks for taking care this. This looks good to me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation

2020-06-29 Thread GitBox


holdenk commented on pull request #28951:
URL: https://github.com/apache/spark/pull/28951#issuecomment-651415959


   Good catch. LGTM but I'll leave it for a bit of a SQL committer has any 
thoughts.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1

2020-06-29 Thread GitBox


holdenk commented on pull request #28950:
URL: https://github.com/apache/spark/pull/28950#issuecomment-651414728


   Thanks for the ping @dongjoon-hyun & thanks for working on this PR @codesue, 
I've been meaning to take a look at cloudpickle's updates. @viirya I think 
backporting cloudpickle changes is a bit risky given how core it is. I'd rather 
target the latest cloud pickle to Spark 3.1 if that sounds good to folks?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28946: [SPARK-32123][PYSPARK] Setting `spark.sql.session.timeZone` only partially respected

2020-06-29 Thread GitBox


AmplabJenkins removed a comment on pull request #28946:
URL: https://github.com/apache/spark/pull/28946#issuecomment-651096568


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28946: [SPARK-32123][PYSPARK] Setting `spark.sql.session.timeZone` only partially respected

2020-06-29 Thread GitBox


holdenk commented on pull request #28946:
URL: https://github.com/apache/spark/pull/28946#issuecomment-651413402


   Jenkins ok to test
   cc @BryanCutler 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28944: [SPARK-32128][SQL]import SQLConf.PARTITION_OVERWRITE_VERIFY_PATH config

2020-06-29 Thread GitBox


AmplabJenkins removed a comment on pull request #28944:
URL: https://github.com/apache/spark/pull/28944#issuecomment-650986142


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28944: [SPARK-32128][SQL]import SQLConf.PARTITION_OVERWRITE_VERIFY_PATH config

2020-06-29 Thread GitBox


holdenk commented on pull request #28944:
URL: https://github.com/apache/spark/pull/28944#issuecomment-651413175


   Jenkins ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28933: [SPARK-32104][SQL]Avoid full outer join OOM on skewed dataset

2020-06-29 Thread GitBox


AmplabJenkins removed a comment on pull request #28933:
URL: https://github.com/apache/spark/pull/28933#issuecomment-650117903


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28933: [SPARK-32104][SQL]Avoid full outer join OOM on skewed dataset

2020-06-29 Thread GitBox


holdenk commented on pull request #28933:
URL: https://github.com/apache/spark/pull/28933#issuecomment-651412836


   Jenkins ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28924: [SPARK-32091][CORE] Ignore timeout error when remove blocks on the lost executor

2020-06-29 Thread GitBox


holdenk commented on pull request #28924:
URL: https://github.com/apache/spark/pull/28924#issuecomment-651412453


   Also for `user facing` change maybe "less failures" which is good and we 
should call out here so we can mention it in the release notes and encourage 
folks to upgrade.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28924: [SPARK-32091][CORE] Ignore timeout error when remove blocks on the lost executor

2020-06-29 Thread GitBox


holdenk commented on a change in pull request #28924:
URL: https://github.com/apache/spark/pull/28924#discussion_r447303794



##
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##
@@ -95,6 +97,13 @@ class BlockManagerMasterEndpoint(
   private val externalShuffleServiceRddFetchEnabled: Boolean = 
externalBlockStoreClient.isDefined
   private val externalShuffleServicePort: Int = 
StorageUtils.externalShuffleServicePort(conf)
 
+  private lazy val driverEndpoint = {

Review comment:
   Would `makeDriverRef` in `RpcUtils` be appropriate here?

##
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##
@@ -168,6 +177,37 @@ class BlockManagerMasterEndpoint(
   stop()
   }
 
+  private def handleFailure[T](

Review comment:
   I think this function could use a docstring.

##
File path: core/src/main/scala/org/apache/spark/util/RpcUtils.scala
##
@@ -54,6 +56,12 @@ private[spark] object RpcUtils {
 RpcTimeout(conf, Seq(RPC_LOOKUP_TIMEOUT.key, NETWORK_TIMEOUT.key), "120s")
   }
 
+  /**
+   * Infinite timeout is used internally, so there's no actual timeout 
property controls it.
+   * And timeout property should never be accessed since infinite means we 
never timeout.

Review comment:
   I'm not sure I follow this sentence correctly, can you try and reword it?

##
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##
@@ -350,11 +388,13 @@ class BlockManagerMasterEndpoint(
 if (locations != null) {
   locations.foreach { blockManagerId: BlockManagerId =>
 val blockManager = blockManagerInfo.get(blockManagerId)
-if (blockManager.isDefined) {
+blockManager.foreach { bm =>
   // Remove the block from the slave's BlockManager.
   // Doesn't actually wait for a confirmation and the message might 
get lost.
   // If message loss becomes frequent, we should add retry logic here.
-  blockManager.get.slaveEndpoint.ask[Boolean](RemoveBlock(blockId))
+  bm.slaveEndpoint.ask[Boolean](RemoveBlock(blockId)).recover {
+handleFailure("block", blockId.toString, bm.blockManagerId, false)

Review comment:
   same comment as before.

##
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
##
@@ -177,6 +180,95 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 blockManager.stop()
   }
 
+  private def setupBlockManagerMasterWithBlocks(withLost: Boolean): Unit = {
+// set up a simple DriverEndpoint which simply adds executorIds and
+// check whether a certain executorId has been added before.

Review comment:
   nit:s/check/cheks/

##
File path: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##
@@ -235,7 +273,9 @@ class BlockManagerMasterEndpoint(
 val removeMsg = RemoveShuffle(shuffleId)
 Future.sequence(
   blockManagerInfo.values.map { bm =>
-bm.slaveEndpoint.ask[Boolean](removeMsg)
+bm.slaveEndpoint.ask[Boolean](removeMsg).recover {
+  handleFailure("shuffle", shuffleId.toString, bm.blockManagerId, 
false)

Review comment:
   either add the comment as in the previous call, or pass by name here for 
clarity.

##
File path: core/src/main/scala/org/apache/spark/util/RpcUtils.scala
##
@@ -54,6 +56,12 @@ private[spark] object RpcUtils {
 RpcTimeout(conf, Seq(RPC_LOOKUP_TIMEOUT.key, NETWORK_TIMEOUT.key), "120s")
   }
 
+  /**
+   * Infinite timeout is used internally, so there's no actual timeout 
property controls it.
+   * And timeout property should never be accessed since infinite means we 
never timeout.
+   * */

Review comment:
   nit: `* */`, we use `*/` more commonly in spark.

##
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
##
@@ -93,6 +94,7 @@ class BlockManagerSuite extends SparkFunSuite with Matchers 
with BeforeAndAfterE
   .set(MEMORY_STORAGE_FRACTION, 0.999)
   .set(Kryo.KRYO_SERIALIZER_BUFFER_SIZE.key, "1m")
   .set(STORAGE_UNROLL_MEMORY_THRESHOLD, 512L)
+  .set(Network.RPC_ASK_TIMEOUT, "5s")

Review comment:
   Any particular reason why 5?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] manuzhang commented on pull request #28916: [SPARK-32083][SQL] Coalesce to one partition when all partitions are empty in AQE

2020-06-29 Thread GitBox


manuzhang commented on pull request #28916:
URL: https://github.com/apache/spark/pull/28916#issuecomment-651407512


   @viirya @cloud-fan 
   I've updated the PR description with an example. This is more of an 
improvement I propose for certain cases. Please let me know whether it makes 
sense.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] warrenzhu25 commented on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-29 Thread GitBox


warrenzhu25 commented on pull request #28942:
URL: https://github.com/apache/spark/pull/28942#issuecomment-651399906


   > Hi @warrenzhu25 , thank you for your contribution.
   > This PR seems to add a new feature so could you add a testcase for it?
   > You can find tests for the status API in `UISeleniumSuite` and 
`HistoryServerSuite`.
   
   Added UT, but it seems doc build failed. It seems unrelated with my change.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors

2020-06-29 Thread GitBox


holdenk commented on pull request #28619:
URL: https://github.com/apache/spark/pull/28619#issuecomment-651388307


   Took a quick look, thanks for working on this. I think having a timeout to 
kill the executors regardless (e.g. a max decommissioning time) and the 
speculation are both useful. I'll follow up more once we've decided on the 
design in OSS.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #28948: [SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options

2020-06-29 Thread GitBox


gengliangwang commented on a change in pull request #28948:
URL: https://github.com/apache/spark/pull/28948#discussion_r447272447



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##
@@ -248,12 +248,17 @@ class CacheManager extends Logging with 
AdaptiveSparkPlanHelper {
* `HadoopFsRelation` node(s) as part of its logical plan.
*/
   def recacheByPath(spark: SparkSession, resourcePath: String): Unit = {

Review comment:
   how about we change the method as
   ```
   def recacheByPath(spark: SparkSession, resourcePath: String, options: 
Map[String, String]=Map.empty)
   ```
   so that we can avoid the new method below?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-29 Thread GitBox


holdenk commented on a change in pull request #28708:
URL: https://github.com/apache/spark/pull/28708#discussion_r447249716



##
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##
@@ -242,8 +244,7 @@ private[spark] class BlockManager(
 
   private var blockReplicationPolicy: BlockReplicationPolicy = _
 
-  private var blockManagerDecommissioning: Boolean = false
-  private var decommissionManager: Option[BlockManagerDecommissionManager] = 
None
+  @volatile private var decommissioner: Option[BlockManagerDecommissioner] = 
None

Review comment:
   That's true. If we drop it we might also accept remove block puts after 
we've started decommissioning though. Depends on how much we want to avoid that.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-29 Thread GitBox


holdenk commented on a change in pull request #28708:
URL: https://github.com/apache/spark/pull/28708#discussion_r447247802



##
File path: core/src/main/scala/org/apache/spark/storage/BlockId.scala
##
@@ -40,6 +40,9 @@ sealed abstract class BlockId {
   def isRDD: Boolean = isInstanceOf[RDDBlockId]
   def isShuffle: Boolean = isInstanceOf[ShuffleBlockId] || 
isInstanceOf[ShuffleBlockBatchId]
   def isBroadcast: Boolean = isInstanceOf[BroadcastBlockId]
+  def isInternalShuffle: Boolean = {

Review comment:
   Looking at it, not widely used I'll audit each use case and then decide.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-29 Thread GitBox


holdenk commented on a change in pull request #28708:
URL: https://github.com/apache/spark/pull/28708#discussion_r447247346



##
File path: 
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
##
@@ -148,6 +170,82 @@ private[spark] class IndexShuffleBlockResolver(
 }
   }
 
+  /**
+   * Write a provided shuffle block as a stream. Used for block migrations.
+   * ShuffleBlockBatchIds must contain the full range represented in the 
ShuffleIndexBlock.
+   * Requires the caller to delete any shuffle index blocks where the shuffle 
block fails to
+   * put.
+   */
+  override def putShuffleBlockAsStream(blockId: BlockId, serializerManager: 
SerializerManager):
+  StreamCallbackWithID = {
+val file = blockId match {
+  case ShuffleIndexBlockId(shuffleId, mapId, _) =>
+getIndexFile(shuffleId, mapId)
+  case ShuffleDataBlockId(shuffleId, mapId, _) =>
+getDataFile(shuffleId, mapId)
+  case _ =>
+throw new Exception(s"Unexpected shuffle block transfer ${blockId} as 
" +

Review comment:
   sgtm





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown

2020-06-29 Thread GitBox


holdenk commented on a change in pull request #28708:
URL: https://github.com/apache/spark/pull/28708#discussion_r447247000



##
File path: 
core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
##
@@ -55,6 +58,25 @@ private[spark] class IndexShuffleBlockResolver(
 
   def getDataFile(shuffleId: Int, mapId: Long): File = getDataFile(shuffleId, 
mapId, None)
 
+  /**
+   * Get the shuffle files that are stored locally. Used for block migrations.
+   */
+  override def getStoredShuffles(): Set[ShuffleBlockInfo] = {
+// Matches ShuffleIndexBlockId name
+val pattern = "shuffle_(\\d+)_(\\d+)_.+\\.index".r
+val rootDirs = blockManager.diskBlockManager.localDirs
+// ExecutorDiskUtil puts things inside one level hashed sub directories
+val searchDirs = rootDirs.flatMap(_.listFiles()).filter(_.isDirectory()) 
++ rootDirs
+val filenames = searchDirs.flatMap(_.list())
+logDebug(s"Got block files ${filenames.toList}")
+filenames.flatMap { fname =>
+  pattern.findAllIn(fname).matchData.map {
+matched => ShuffleBlockInfo(matched.group(1).toInt, 
matched.group(2).toLong)
+  }
+}.toSet

Review comment:
   There shouldn't be any duplicates, but if there are we only need to 
transfer one anyways.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #27620: [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files

2020-06-29 Thread GitBox


HeartSaVioR commented on pull request #27620:
URL: https://github.com/apache/spark/pull/27620#issuecomment-651313422


   retest this, please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >