[GitHub] [spark] dilipbiswal commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
dilipbiswal commented on a change in pull request #28953: URL: https://github.com/apache/spark/pull/28953#discussion_r447427847 ## File path: docs/sql-data-sources-jdbc.md ## @@ -156,6 +156,20 @@ the following case-insensitive options: + + preActions + + You can specify custom queries which you want to run before reading data from JDBC or writing data to JDBC. It is called per DataFrame, not per session. + + + + + postActions + +This is a JDBC writer related option. You can specify custom queries which you want to run after writing data to JDBC. It is called per DataFrame, not per session. Review comment: @moomindani Same question for post action. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
dilipbiswal commented on a change in pull request #28953: URL: https://github.com/apache/spark/pull/28953#discussion_r447427697 ## File path: docs/sql-data-sources-jdbc.md ## @@ -156,6 +156,20 @@ the following case-insensitive options: + + preActions + + You can specify custom queries which you want to run before reading data from JDBC or writing data to JDBC. It is called per DataFrame, not per session. Review comment: @moomindani Do we allow multiple pre actions ? Or we allow just one ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation
dilipbiswal commented on a change in pull request #28951: URL: https://github.com/apache/spark/pull/28951#discussion_r447426571 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala ## @@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers { } } } + + test("SPARK-32131 Fix wrong column index when we have more than two columns" + +" during union and set operations" ) { +val firstTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val secondTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", TimestampType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val thirdTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", TimestampType)(), + AttributeReference("d", FloatType)()) + +val fourthTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", TimestampType)()) + +val a1 = firstTable.output(0) +val b1 = firstTable.output(1) +val c1 = firstTable.output(2) +val d1 = firstTable.output(3) + +val a2 = secondTable.output(0) +val b2 = secondTable.output(1) +val c2 = secondTable.output(2) +val d2 = secondTable.output(3) + +val a3 = thirdTable.output(0) Review comment: ditto ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala ## @@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers { } } } + + test("SPARK-32131 Fix wrong column index when we have more than two columns" + +" during union and set operations" ) { +val firstTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val secondTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", TimestampType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val thirdTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", TimestampType)(), + AttributeReference("d", FloatType)()) + +val fourthTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", TimestampType)()) + +val a1 = firstTable.output(0) +val b1 = firstTable.output(1) +val c1 = firstTable.output(2) +val d1 = firstTable.output(3) + +val a2 = secondTable.output(0) +val b2 = secondTable.output(1) +val c2 = secondTable.output(2) +val d2 = secondTable.output(3) + +val a3 = thirdTable.output(0) +val b3 = thirdTable.output(1) +val c3 = thirdTable.output(2) +val d3 = thirdTable.output(3) + +val a4 = fourthTable.output(0) Review comment: ditto This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation
dilipbiswal commented on a change in pull request #28951: URL: https://github.com/apache/spark/pull/28951#discussion_r447426485 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala ## @@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers { } } } + + test("SPARK-32131 Fix wrong column index when we have more than two columns" + +" during union and set operations" ) { +val firstTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val secondTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", TimestampType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val thirdTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", TimestampType)(), + AttributeReference("d", FloatType)()) + +val fourthTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", TimestampType)()) + +val a1 = firstTable.output(0) Review comment: @GuoPhilipse Variables a1, b1, c1, d1 not used ? Were you planning to use it in the test later ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on a change in pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation
dilipbiswal commented on a change in pull request #28951: URL: https://github.com/apache/spark/pull/28951#discussion_r447426527 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala ## @@ -831,4 +831,77 @@ class AnalysisSuite extends AnalysisTest with Matchers { } } } + + test("SPARK-32131 Fix wrong column index when we have more than two columns" + +" during union and set operations" ) { +val firstTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val secondTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", TimestampType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", FloatType)()) + +val thirdTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", TimestampType)(), + AttributeReference("d", FloatType)()) + +val fourthTable = LocalRelation( + AttributeReference("a", StringType)(), + AttributeReference("b", DoubleType)(), + AttributeReference("c", IntegerType)(), + AttributeReference("d", TimestampType)()) + +val a1 = firstTable.output(0) +val b1 = firstTable.output(1) +val c1 = firstTable.output(2) +val d1 = firstTable.output(3) + +val a2 = secondTable.output(0) Review comment: ditto This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liancheng commented on a change in pull request #28948: [SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options
liancheng commented on a change in pull request #28948: URL: https://github.com/apache/spark/pull/28948#discussion_r447426398 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ## @@ -248,12 +248,17 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper { * `HadoopFsRelation` node(s) as part of its logical plan. */ def recacheByPath(spark: SparkSession, resourcePath: String): Unit = { Review comment: That also works, but it feels a little bit weird to couple the data source options concept with the cache manager... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning
imback82 commented on a change in pull request #28676: URL: https://github.com/apache/spark/pull/28676#discussion_r447418040 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala ## @@ -554,7 +554,7 @@ class AdaptiveQueryExecSuite val smj = findTopLevelSortMergeJoin(plan) assert(smj.size == 2) val smj2 = findTopLevelSortMergeJoin(adaptivePlan) - assert(smj2.size == 2, origPlan.toString) + assert(smj2.size == 1, origPlan.toString) Review comment: @cloud-fan (the initial tagging didn't work), do you have a suggestion on an example where a plan change could introduce a greater cost? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning
imback82 commented on a change in pull request #28676: URL: https://github.com/apache/spark/pull/28676#discussion_r447411442 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala ## @@ -60,6 +62,92 @@ case class BroadcastHashJoinExec( } } + override def outputPartitioning: Partitioning = { +val (buildKeys, streamedKeys) = buildSide match { + case BuildLeft => (leftKeys, rightKeys) + case BuildRight => (rightKeys, leftKeys) +} + +joinType match { + case _: InnerLike => +streamedPlan.outputPartitioning match { + case h: HashPartitioning => +getBuildSidePartitioning(h, streamedKeys, buildKeys) match { + case Some(p) => PartitioningCollection(Seq(h, p)) + case None => h +} + case c: PartitioningCollection => +c.partitionings.foreach { + case h: HashPartitioning => +getBuildSidePartitioning(h, streamedKeys, buildKeys) match { + case Some(p) => return PartitioningCollection(c.partitionings :+ p) + case None => () +} + case _ => () +} +c + case other => other +} + case _ => streamedPlan.outputPartitioning +} + } + + /** Review comment: In that case, I can revert back to this [commit](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2). Do you see anything missing from that commit? [This](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2#r434999159) seems fine right (meaning we can rely on the streamed side's output partitioning)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #28676: [WIP][SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning
imback82 commented on a change in pull request #28676: URL: https://github.com/apache/spark/pull/28676#discussion_r447411442 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala ## @@ -60,6 +62,92 @@ case class BroadcastHashJoinExec( } } + override def outputPartitioning: Partitioning = { +val (buildKeys, streamedKeys) = buildSide match { + case BuildLeft => (leftKeys, rightKeys) + case BuildRight => (rightKeys, leftKeys) +} + +joinType match { + case _: InnerLike => +streamedPlan.outputPartitioning match { + case h: HashPartitioning => +getBuildSidePartitioning(h, streamedKeys, buildKeys) match { + case Some(p) => PartitioningCollection(Seq(h, p)) + case None => h +} + case c: PartitioningCollection => +c.partitionings.foreach { + case h: HashPartitioning => +getBuildSidePartitioning(h, streamedKeys, buildKeys) match { + case Some(p) => return PartitioningCollection(c.partitionings :+ p) + case None => () +} + case _ => () +} +c + case other => other +} + case _ => streamedPlan.outputPartitioning +} + } + + /** Review comment: In that case, I can revert back to this [commit](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2). Do you see anything missing from that commit? [This](https://github.com/apache/spark/pull/28676/commits/683a70528e4cfdf79e4ef9596e728c768c86afa2#r434999159) seems fine right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28916: [SPARK-32083][SQL] Coalesce to one partition when all partitions are empty in AQE
cloud-fan commented on pull request #28916: URL: https://github.com/apache/spark/pull/28916#issuecomment-651527077 I think the key problem is we skip `CoalesceShufflePartitions` when `ShuffleQueryStageExec#mapStats` is None. This can happen when the input RDD of the shuffle has 0 partitions. I think we should still apply `CoalesceShufflePartitions` in this case and wrap `ShuffleQueryStageExec` with `CustomShuffleReaderExec` with `partitionSpecs` as Nil. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28940: [SPARK-32121][SHUFFLE][TEST] Fix ExternalShuffleBlockResolverSuite failed on Windows
HyukjinKwon commented on pull request #28940: URL: https://github.com/apache/spark/pull/28940#issuecomment-651517591 Build started: [CORE] `org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite` [![PR-28940](https://ci.appveyor.com/api/projects/status/github/HyukjinKwon/spark?branch=92C9A950-6909-4EB4-93E1-18523B43DF46=true)](https://ci.appveyor.com/project/HyukjinKwon/spark/branch/92C9A950-6909-4EB4-93E1-18523B43DF46) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #28952: [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and sql when AQE is enabled
viirya commented on pull request #28952: URL: https://github.com/apache/spark/pull/28952#issuecomment-651516656 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
beliefer commented on a change in pull request #28917: URL: https://github.com/apache/spark/pull/28917#discussion_r447395194 ## File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ## @@ -278,7 +280,26 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi override def beforeEach(): Unit = { super.beforeEach() -init(new SparkConf()) + } + + override protected def test(testName: String, testTags: Tag*)(testFun: => Any) + (implicit pos: Position): Unit = { +testWithSparkConf(testName, testTags: _*)()(testFun)(pos) + } + + private def testWithSparkConf(testName: String, testTags: Tag*) + (pairs: (String, String)*)(testFun: => Any)(implicit pos: Position): Unit = { +super.test(testName, testTags: _*) { + withSparkConf(pairs: _*)(testFun) +} + } + + /** Sets all configurations specified in `pairs`, calls `init`, and then calls `testFun` */ + private def withSparkConf(pairs: (String, String)*)(testFun: => Any): Unit = { +val conf = new SparkConf() +pairs.foreach(kv => conf.set(kv._1, kv._2)) +init(conf) Review comment: OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
beliefer commented on a change in pull request #28917: URL: https://github.com/apache/spark/pull/28917#discussion_r447395077 ## File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ## @@ -278,7 +280,26 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi override def beforeEach(): Unit = { super.beforeEach() -init(new SparkConf()) + } + + override protected def test(testName: String, testTags: Tag*)(testFun: => Any) + (implicit pos: Position): Unit = { +testWithSparkConf(testName, testTags: _*)()(testFun)(pos) + } + + private def testWithSparkConf(testName: String, testTags: Tag*) Review comment: Most SQL-related configuration parameters can be changed dynamically, but most of Core's parameters are static. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation
HyukjinKwon commented on pull request #28951: URL: https://github.com/apache/spark/pull/28951#issuecomment-651512938 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1
HyukjinKwon commented on pull request #28950: URL: https://github.com/apache/spark/pull/28950#issuecomment-651512488 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1
HyukjinKwon commented on pull request #28950: URL: https://github.com/apache/spark/pull/28950#issuecomment-651511487 Yeah, to upgrade we should drop Python 2. I target to drop it in Spark 3.1. I will make a PR to officially drop first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kiszk commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
kiszk commented on a change in pull request #28953: URL: https://github.com/apache/spark/pull/28953#discussion_r447389935 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala ## @@ -156,9 +157,16 @@ object JDBCRDD extends Logging { val url = options.url val dialect = JdbcDialects.get(url) val quotedColumns = requiredColumns.map(colName => dialect.quoteIdentifier(colName)) + +val conn = JdbcUtils.createConnectionFactory(options) +options.preActions match { + case Some(i) => +runQuery(conn(), i, options) Review comment: nit: `conn()` -> `conn` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] erenavsarogullari edited a comment on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite
erenavsarogullari edited a comment on pull request #28865: URL: https://github.com/apache/spark/pull/28865#issuecomment-651498167 Thanks @dongjoon-hyun for the review. All comments are addressed. I think it is ready to go. Also, we plan to use Prometheus + Grafana with proposed format change (through custom release). It may still be useful for the long-term if we plan to revisit the current format in the future. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] erenavsarogullari edited a comment on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite
erenavsarogullari edited a comment on pull request #28865: URL: https://github.com/apache/spark/pull/28865#issuecomment-651498167 Thanks @dongjoon-hyun for the review. All comments are addressed. I think it is ready to go. Also, we plan to use Prometheus + Grafana with proposed format change. It may still be useful for the long-term if we plan to revisit the current format in the future. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsExceptio
turboFei commented on pull request #26339: URL: https://github.com/apache/spark/pull/26339#issuecomment-651498450 Gentle ping @dongjoon-hyun @dbtsai This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] erenavsarogullari commented on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite
erenavsarogullari commented on pull request #28865: URL: https://github.com/apache/spark/pull/28865#issuecomment-651498167 Thanks @dongjoon-hyun for the review. All comments are addressed. I think it is ready to go. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] erenavsarogullari commented on a change in pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite
erenavsarogullari commented on a change in pull request #28865: URL: https://github.com/apache/spark/pull/28865#discussion_r447380991 ## File path: core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import com.codahale.metrics.{Counter, Gauge, MetricRegistry} + +import org.scalatest.PrivateMethodTester +import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite} + + +class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester { + + test("PrometheusServletSuite registered metrics test") { +val sink = createPrometheusServlet() + +val gauge = new Gauge[Double] { + override def getValue: Double = 5.0 +} + +val counter = new Counter +counter.inc(10) + +sink.registry.register("gauge1", gauge) +sink.registry.register("gauge2", gauge) +sink.registry.register("counter1", counter) + +val metricGaugeKeys = sink.registry.getGauges.keySet.asScala +assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")), + "Should contain 2 gauges metrics registered") + +val metricCounterKeys = sink.registry.getCounters.keySet.asScala +assert(metricCounterKeys.equals(Set("counter1")), + "Should contain 1 counter metric registered") + +val gaugeValues = sink.registry.getGauges.values.asScala +assert(gaugeValues.size == 2) +gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0)) + +val counterValues = sink.registry.getCounters.values.asScala +assert(counterValues.size == 1) +counterValues.foreach(counter => assert(counter.getCount == 10)) + } + + test("PrometheusServletSuite private normalizeKey function test") { +val key = "local-1592132938718.driver.LiveListenerBus." + + "listenerProcessingTime.org.apache.spark.HeartbeatReceiver" +val sink = createPrometheusServlet() +val suffix = sink invokePrivate PrivateMethod[String]('normalizeKey)(key) +assert(suffix == "metrics_local_1592132938718_driver_LiveListenerBus_" + + "listenerProcessingTime_org_apache_spark_HeartbeatReceiver_") + } + + private def createPrometheusServlet(): PrometheusServlet = { +val props = new Properties +props.put("host", "127.0.0.1") +props.put("port", "12340") Review comment: Actually, both host and port seems not required for the current UTs so removed them. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #26339: [SPARK-27194][SPARK-29302][SQL] For dynamic partition overwrite operation, fix speculation task conflict issue and FileAlreadyExistsExcep
LuciferYang commented on pull request #26339: URL: https://github.com/apache/spark/pull/26339#issuecomment-651497369 @dongjoon-hyun @turboFei Is this PR still being worked on? We are having similar issues in our production environment, and I found there are similar PRs try to solve this problem, such as #26090, #26971 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651484627 (IMHO it might be still good chance to leverage this PR to construct a good way for versioning properly - so that version 2 can be used as an interim with best practice on versioning, and we get version 3 with such major features included without headache on versioning.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
dongjoon-hyun commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-651487745 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite
dongjoon-hyun commented on pull request #28865: URL: https://github.com/apache/spark/pull/28865#issuecomment-651487054 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1
dongjoon-hyun commented on pull request #28950: URL: https://github.com/apache/spark/pull/28950#issuecomment-651485642 +1 for @holdenk 's advice. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API
sarutak commented on pull request #28942: URL: https://github.com/apache/spark/pull/28942#issuecomment-651485037 ok to test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API
AmplabJenkins removed a comment on pull request #28942: URL: https://github.com/apache/spark/pull/28942#issuecomment-650836727 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API
sarutak commented on pull request #28942: URL: https://github.com/apache/spark/pull/28942#issuecomment-651485158 cc: @squito This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651484627 (IMHO it might be still good to leverage this PR to be a chance to construct a good way for versioning properly - so that version 2 can be used as an interim with best practice on versioning, and we get version 3 with such major features included without headache on versioning.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
beliefer commented on a change in pull request #28917: URL: https://github.com/apache/spark/pull/28917#discussion_r447372063 ## File path: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ## @@ -278,7 +280,26 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi override def beforeEach(): Unit = { super.beforeEach() -init(new SparkConf()) + } + + override protected def test(testName: String, testTags: Tag*)(testFun: => Any) + (implicit pos: Position): Unit = { +testWithSparkConf(testName, testTags: _*)()(testFun)(pos) + } + + private def testWithSparkConf(testName: String, testTags: Tag*) + (pairs: (String, String)*)(testFun: => Any)(implicit pos: Position): Unit = { +super.test(testName, testTags: _*) { + withSparkConf(pairs: _*)(testFun) +} + } + + /** Sets all configurations specified in `pairs`, calls `init`, and then calls `testFun` */ + private def withSparkConf(pairs: (String, String)*)(testFun: => Any): Unit = { Review comment: OK. I will simulate `SQLHelper`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign metadata log as well as file stream source itself, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade, **before proceeding 2**. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign metadata log as well as file stream source itself, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirement altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel. That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirement altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector
maropu commented on pull request #28863: URL: https://github.com/apache/spark/pull/28863#issuecomment-651478897 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28864: [SPARK-32004][ALL] Drop references to slave
holdenk commented on a change in pull request #28864: URL: https://github.com/apache/spark/pull/28864#discussion_r447366211 ## File path: resource-managers/mesos/src/main/scala/org/apache/spark/executor/MesosExecutorBackend.scala ## @@ -48,7 +46,7 @@ private[spark] class MesosExecutorBackend val mesosTaskId = TaskID.newBuilder().setValue(taskId.toString).build() driver.sendStatusUpdate(MesosTaskStatus.newBuilder() .setTaskId(mesosTaskId) - .setState(taskStateToMesos(state)) + .setState(MesosSchedulerBackendUtil.taskStateToMesos(state)) Review comment: Yeah, I was in the area and noticed a TODO I could resolve quickly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
maropu commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-651469890 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
maropu commented on a change in pull request #28953: URL: https://github.com/apache/spark/pull/28953#discussion_r447359906 ## File path: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala ## @@ -574,6 +576,41 @@ class JDBCWriteSuite extends SharedSparkSession with BeforeAndAfter { } } + test("SPARK-32013: option preActions/postActions, run SQL before writing data.") { Review comment: nit: Basically, we don't need the prefix, e.g., SPARK-32013, when adding new features. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani edited a comment on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
moomindani edited a comment on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-651467865 @dilipbiswal Sure I added it in this PR description. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin commented on pull request #28935: [SPARK-20680][SQL] Adding HiveVoidType in Spark to be compatible with Hive
LantaoJin commented on pull request #28935: URL: https://github.com/apache/spark/pull/28935#issuecomment-651469114 @cloud-fan I refactor some codes, now I think this PR could be no dependency. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani commented on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
moomindani commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-651467865 @dilipbiswal Sure I will add it in this PR description. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
dilipbiswal commented on pull request #28953: URL: https://github.com/apache/spark/pull/28953#issuecomment-651466295 @moomindani Thanks. Could we illustrate the usage of these two options via examples in the PR description ? I think, it will help the reviewers. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xianyinxin commented on pull request #28943: [SPARK-32127][SQL]: Check rules for MERGE INTO should use MergeAction.conditition other than MergeAction.children
xianyinxin commented on pull request #28943: URL: https://github.com/apache/spark/pull/28943#issuecomment-651465152 Thanks @cloud-fan ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xianyinxin commented on pull request #28875: [SPARK-32030][SPARK-32127][SQL] Support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO
xianyinxin commented on pull request #28875: URL: https://github.com/apache/spark/pull/28875#issuecomment-651465114 Thanks @cloud-fan ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moomindani opened a new pull request #28953: [SPARK-32013][SQL] Support query execution before/after reading/writing DataFrame over JDBC
moomindani opened a new pull request #28953: URL: https://github.com/apache/spark/pull/28953 ### What changes were proposed in this pull request? This pull request is to support query execution before/after reading/writing over JDBC. There are two new options; `preActions` and `postActions` in DataFrame's JDBC configuration. SQL statements specified in `preActions` will be executed before reading/writing DataFrame via JDBC. SQL statements specified in `postActions` will be executed after writing DataFrame via JDBC. Note: `postActions` is only supported in JDBC writer, not in JDBC reader. It is because it won't be needed so often and I was not able to find good places to implement this. ### Why are the changes needed? For ETL workload, there is a common requirement to perform SQL statement before/after reading/writing over JDBC. Here's examples; - Create a view with specific conditions - Delete/Update some records - Truncate a table (it is already possible in `truncate` option) - Execute stored procedure (it is also requested in SPARK-32014) ### Does this PR introduce _any_ user-facing change? Yes. With this feature, users can run any SQL statements before/after reading/writing DataFrame over JDBC. It does not affect any existing behavior. It just adds new options to use this new feature. ### How was this patch tested? I added test cases into `JDBCSuite.scala` and `JDBCWriteSuite.scala`, and confirmed all the tests have been passed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tharradine commented on pull request #28946: [SPARK-32123][PYSPARK] Setting `spark.sql.session.timeZone` only partially respected
tharradine commented on pull request #28946: URL: https://github.com/apache/spark/pull/28946#issuecomment-651455471 This isn't exactly the SPARK-32123 fix I was expecting, I was expecting the behaviour mentioned in the [docs](https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#timestamp-with-time-zone-semantics) to be adhered to, specifically: > When timestamp data is exported or displayed in Spark, the session time zone is used to localize the timestamp values. Is it not preferable to perform this localization whilst converting from/to internal timestamp datatypes, and preserve the documented behaviour? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
zsxwing commented on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651454246 The numbers are pretty impressive. Thanks a lot for your work. My high level comments regarding the PR: - The compression codec should not be hardcoded. It's better to allow the user to config the writer codec like other codec configs in Spark. The reader should be able to identify the codec and load files correctly. Otherwise, we would need to bump the version again when adding more codec support. - The cost of bumping file sink version is much higher than the checkpoint. The file sink metadata should be compatible with old versions if possible like other storage formats such as parquet, orc. - For example, in some company, the team who generates the output may not be able to convince the downstream to upgrade their Spark version. - The file sink metadata can be read by an external system. Bumping the version will need the ecosystem to build a new reader. - If a user hits a critical issue in new Spark version, they may want to rollback. Bumping the version will break this workflow. - It's better to make the default behavior like this: - Use v1 if an existing directory is using v1, so that we can still support rollback. The user can set a flag to enable v2 explicitly. - Use v2 if it's a new directory. I'm wondering if we can step back to think about whether it is possible solve the issue without bumping the version. IMO, the major issue is the number of files increases infinitely over time. This PR increase the upper bound but doesn't change the complexity. Maybe we should try to see how to reduce the number of files, such as supporting data compaction? You may notice that `FileStreamSinkLog.DELETE_ACTION` is never used. This is because it was added to support data compaction but we didn't implement it yet. I totally understand that we may not be able to solve the metadata issue without bumping the file metadata version. Then it's better to design a better file sink log format to solve all of the following issues so that we don't need to bump the version again in future: - Data compaction support. - Open source format. It would be great that the new file sink log format can be read by Spark and other systems. An extra benefit is we can use Spark itself to process the metadata so that we don't have to put all metadata in the driver memory. - Better file stream source support. Since we always append files to the file sink, when reading a file sink using streaming queries, it would be great we can locate the new appended files quickly. For example, we can use `log file name + line number` as the streaming source offset instead, then we can jump to the log file directly. - Filter push down support when reading a file sink directory. For example, if a user queries only one partition, we don't need to load metadata of other partitions into the driver memory. - Decouple the file sink from the streaming query. Currently, one file sink output always maps to one streaming query. The user may want to use another streaming query or even batch query to write to the same file sink. But as we use the batch id as the file name, this is not supported. - Use relative path. Copying the file sink directory will break it today. It would be great that we remember the relative paths so that people can move or copy the directory. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector
dongjoon-hyun commented on pull request #28863: URL: https://github.com/apache/spark/pull/28863#issuecomment-651449133 Hi, @gaborgsomogyi . Is `OracleKrbIntegrationSuite` missing here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector
AmplabJenkins removed a comment on pull request #28863: URL: https://github.com/apache/spark/pull/28863#issuecomment-651138861 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124640/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28863: [SPARK-31336][SQL] Support Oracle Kerberos login in JDBC connector
dongjoon-hyun commented on pull request #28863: URL: https://github.com/apache/spark/pull/28863#issuecomment-651442483 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #28864: [SPARK-32004][ALL] Drop references to slave
rdblue commented on a change in pull request #28864: URL: https://github.com/apache/spark/pull/28864#discussion_r447336956 ## File path: resource-managers/mesos/src/main/scala/org/apache/spark/executor/MesosExecutorBackend.scala ## @@ -48,7 +46,7 @@ private[spark] class MesosExecutorBackend val mesosTaskId = TaskID.newBuilder().setValue(taskId.toString).build() driver.sendStatusUpdate(MesosTaskStatus.newBuilder() .setTaskId(mesosTaskId) - .setState(taskStateToMesos(state)) + .setState(MesosSchedulerBackendUtil.taskStateToMesos(state)) Review comment: Was this method moved to implement the TODO item? This looks unrelated to the rename changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave
rdblue commented on pull request #28864: URL: https://github.com/apache/spark/pull/28864#issuecomment-651442680 The updates look good. +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #27963: [SPARK-31199]separate shuffle io connect timeout from idle timeout
github-actions[bot] commented on pull request #27963: URL: https://github.com/apache/spark/pull/27963#issuecomment-651439774 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #27971: [SPARK-31206][SQL] AQE should not use the same SubqueryExec when reuse is off
github-actions[bot] closed pull request #27971: URL: https://github.com/apache/spark/pull/27971 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #24939: [SPARK-18569][ML][R] Support RFormula arithmetic, I() and spark functions
github-actions[bot] closed pull request #24939: URL: https://github.com/apache/spark/pull/24939 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28865: [SPARK-32026][CORE][TEST] Add PrometheusServletSuite
dongjoon-hyun commented on pull request #28865: URL: https://github.com/apache/spark/pull/28865#issuecomment-651438818 Also, please update the PR description consistently together. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage
dongjoon-hyun commented on a change in pull request #28865: URL: https://github.com/apache/spark/pull/28865#discussion_r447332799 ## File path: core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import com.codahale.metrics.{Counter, Gauge, MetricRegistry} + +import org.scalatest.PrivateMethodTester +import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite} + + +class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester { + + test("PrometheusServletSuite registered metrics test") { +val sink = createPrometheusServlet() + +val gauge = new Gauge[Double] { + override def getValue: Double = 5.0 +} + +val counter = new Counter +counter.inc(10) + +sink.registry.register("gauge1", gauge) +sink.registry.register("gauge2", gauge) +sink.registry.register("counter1", counter) + +val metricGaugeKeys = sink.registry.getGauges.keySet.asScala +assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")), + "Should contain 2 gauges metrics registered") + +val metricCounterKeys = sink.registry.getCounters.keySet.asScala +assert(metricCounterKeys.equals(Set("counter1")), + "Should contain 1 counter metric registered") + +val gaugeValues = sink.registry.getGauges.values.asScala +assert(gaugeValues.size == 2) +gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0)) + +val counterValues = sink.registry.getCounters.values.asScala +assert(counterValues.size == 1) +counterValues.foreach(counter => assert(counter.getCount == 10)) + } + + test("PrometheusServletSuite private normalizeKey function test") { Review comment: `normalizeKey` might be enough for the test name in this context. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage
dongjoon-hyun commented on a change in pull request #28865: URL: https://github.com/apache/spark/pull/28865#discussion_r447332927 ## File path: core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import com.codahale.metrics.{Counter, Gauge, MetricRegistry} + +import org.scalatest.PrivateMethodTester +import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite} + + +class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester { + + test("PrometheusServletSuite registered metrics test") { +val sink = createPrometheusServlet() + +val gauge = new Gauge[Double] { + override def getValue: Double = 5.0 +} + +val counter = new Counter +counter.inc(10) + +sink.registry.register("gauge1", gauge) +sink.registry.register("gauge2", gauge) +sink.registry.register("counter1", counter) + +val metricGaugeKeys = sink.registry.getGauges.keySet.asScala +assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")), + "Should contain 2 gauges metrics registered") + +val metricCounterKeys = sink.registry.getCounters.keySet.asScala +assert(metricCounterKeys.equals(Set("counter1")), + "Should contain 1 counter metric registered") + +val gaugeValues = sink.registry.getGauges.values.asScala +assert(gaugeValues.size == 2) +gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0)) + +val counterValues = sink.registry.getCounters.values.asScala +assert(counterValues.size == 1) +counterValues.foreach(counter => assert(counter.getCount == 10)) + } + + test("PrometheusServletSuite private normalizeKey function test") { +val key = "local-1592132938718.driver.LiveListenerBus." + + "listenerProcessingTime.org.apache.spark.HeartbeatReceiver" +val sink = createPrometheusServlet() +val suffix = sink invokePrivate PrivateMethod[String]('normalizeKey)(key) +assert(suffix == "metrics_local_1592132938718_driver_LiveListenerBus_" + + "listenerProcessingTime_org_apache_spark_HeartbeatReceiver_") + } + + private def createPrometheusServlet(): PrometheusServlet = { +val props = new Properties +props.put("host", "127.0.0.1") +props.put("port", "12340") +val registry = new MetricRegistry +val securityMgr = new SecurityManager(new SparkConf(false)) +new PrometheusServlet(props, registry, securityMgr) + } + Review comment: Shall we remove empty line? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage
dongjoon-hyun commented on a change in pull request #28865: URL: https://github.com/apache/spark/pull/28865#discussion_r447332338 ## File path: core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import com.codahale.metrics.{Counter, Gauge, MetricRegistry} + +import org.scalatest.PrivateMethodTester +import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite} + + +class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester { + + test("PrometheusServletSuite registered metrics test") { Review comment: Instead of using test suite name, `PrometheusServletSuite`, shall we simple put "registered metrics" in the test name? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage
dongjoon-hyun commented on a change in pull request #28865: URL: https://github.com/apache/spark/pull/28865#discussion_r447331955 ## File path: core/src/test/scala/org/apache/spark/metrics/sink/PrometheusServletSuite.scala ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.sink + +import java.util.Properties + +import scala.collection.JavaConverters._ + +import com.codahale.metrics.{Counter, Gauge, MetricRegistry} + +import org.scalatest.PrivateMethodTester +import org.apache.spark.{SecurityManager, SparkConf, SparkFunSuite} + + +class PrometheusServletSuite extends SparkFunSuite with PrivateMethodTester { + + test("PrometheusServletSuite registered metrics test") { +val sink = createPrometheusServlet() + +val gauge = new Gauge[Double] { + override def getValue: Double = 5.0 +} + +val counter = new Counter +counter.inc(10) + +sink.registry.register("gauge1", gauge) +sink.registry.register("gauge2", gauge) +sink.registry.register("counter1", counter) + +val metricGaugeKeys = sink.registry.getGauges.keySet.asScala +assert(metricGaugeKeys.equals(Set("gauge1", "gauge2")), + "Should contain 2 gauges metrics registered") + +val metricCounterKeys = sink.registry.getCounters.keySet.asScala +assert(metricCounterKeys.equals(Set("counter1")), + "Should contain 1 counter metric registered") + +val gaugeValues = sink.registry.getGauges.values.asScala +assert(gaugeValues.size == 2) +gaugeValues.foreach(gauge => assert(gauge.getValue == 5.0)) + +val counterValues = sink.registry.getCounters.values.asScala +assert(counterValues.size == 1) +counterValues.foreach(counter => assert(counter.getCount == 10)) + } + + test("PrometheusServletSuite private normalizeKey function test") { +val key = "local-1592132938718.driver.LiveListenerBus." + + "listenerProcessingTime.org.apache.spark.HeartbeatReceiver" +val sink = createPrometheusServlet() +val suffix = sink invokePrivate PrivateMethod[String]('normalizeKey)(key) +assert(suffix == "metrics_local_1592132938718_driver_LiveListenerBus_" + + "listenerProcessingTime_org_apache_spark_HeartbeatReceiver_") + } + + private def createPrometheusServlet(): PrometheusServlet = { +val props = new Properties +props.put("host", "127.0.0.1") +props.put("port", "12340") Review comment: Is the fixed port number safe in the concurrent testing? In Apache Spark Jenkins farm, multiple testing jobs are running concurrently. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28865: [SPARK-32026][TEST] Add PrometheusServlet Unit Test coverage
dongjoon-hyun commented on pull request #28865: URL: https://github.com/apache/spark/pull/28865#issuecomment-651435667 Retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin commented on pull request #28947: [SPARK-32129][SQL] Support AQE skew join with Union
LantaoJin commented on pull request #28947: URL: https://github.com/apache/spark/pull/28947#issuecomment-651435088 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 commented on a change in pull request #28882: [SPARK-31751][SQL]Serde property `path` overwrites hive table property location
TJX2014 commented on a change in pull request #28882: URL: https://github.com/apache/spark/pull/28882#discussion_r447323119 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala ## @@ -218,4 +219,26 @@ class HiveExternalCatalogSuite extends ExternalCatalogSuite { val alteredTable = externalCatalog.getTable("db1", "parq_tbl") assert(alteredTable.provider === Some("foo")) } + + test("SPARK-31751: serde property `path` overwrites hive table property location") { +val catalog = newBasicCatalog() +val hiveTable = CatalogTable( + identifier = TableIdentifier("parq_alter", Some("db1")), + tableType = CatalogTableType.MANAGED, + storage = storageFormat, + schema = new StructType().add("col1", "int"), + provider = Some("parquet")) +catalog.createTable(hiveTable, ignoreIfExists = false) +val beforeAlterTable = externalCatalog.getTable("db1", "parq_alter") + assert(beforeAlterTable.storage.locationUri.toString.contains("parq_alter")) + +externalCatalog.client.runSqlHive( + "alter table db1.parq_alter rename to db1.parq_alter2") + +val e = intercept[AnalysisException]( + externalCatalog.getTable("db1", "parq_alter2") +) +assert(e.getMessage.contains("not equal to table prop path") + && e.getMessage.contains("parq_alter2")) + } Review comment: We will get an exception when the path property is not consistent with storage location. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #28952: [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and sql when AQE is enabled
viirya commented on pull request #28952: URL: https://github.com/apache/spark/pull/28952#issuecomment-651432458 I also think this might be worth creating a new jira ticket, but as initially we discussed it as follow-up. So I put it as a follow-up first. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request #28952: [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and sql when AQE is enabled
viirya opened a new pull request #28952: URL: https://github.com/apache/spark/pull/28952 ### What changes were proposed in this pull request? As the followup of #28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled. ### Why are the changes needed? When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled. ### Does this PR introduce _any_ user-facing change? Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions. ### How was this patch tested? Unit tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rajatahujaatinmobi commented on a change in pull request #28880: [SPARK-29465][YARN][WEBUI] Adding Check to not to set UI port (spark.ui.port) property if mentioned explicitly
rajatahujaatinmobi commented on a change in pull request #28880: URL: https://github.com/apache/spark/pull/28880#discussion_r446745968 ## File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ## @@ -211,9 +211,11 @@ private[spark] class ApplicationMaster( final def run(): Int = { try { val attemptID = if (isClusterMode) { -// Set the web ui port to be ephemeral for yarn so we don't conflict with -// other spark processes running on the same box -System.setProperty(UI_PORT.key, "0") +// Set the web ui port to be ephemeral for yarn if not set explicitly +// so we don't conflict with other spark processes running on the same box +if (System.getProperty(UI_PORT.key) != null) { + System.setProperty(UI_PORT.key, "0") +} Review comment: @tgravescs I have done that. Can we proceed further? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 commented on a change in pull request #28882: [SPARK-31751][SQL]Serde property `path` overwrites hive table property location
TJX2014 commented on a change in pull request #28882: URL: https://github.com/apache/spark/pull/28882#discussion_r447323119 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala ## @@ -218,4 +219,26 @@ class HiveExternalCatalogSuite extends ExternalCatalogSuite { val alteredTable = externalCatalog.getTable("db1", "parq_tbl") assert(alteredTable.provider === Some("foo")) } + + test("SPARK-31751: serde property `path` overwrites hive table property location") { +val catalog = newBasicCatalog() +val hiveTable = CatalogTable( + identifier = TableIdentifier("parq_alter", Some("db1")), + tableType = CatalogTableType.MANAGED, + storage = storageFormat, + schema = new StructType().add("col1", "int"), + provider = Some("parquet")) +catalog.createTable(hiveTable, ignoreIfExists = false) +val beforeAlterTable = externalCatalog.getTable("db1", "parq_alter") + assert(beforeAlterTable.storage.locationUri.toString.contains("parq_alter")) + +externalCatalog.client.runSqlHive( + "alter table db1.parq_alter rename to db1.parq_alter2") + +val e = intercept[AnalysisException]( + externalCatalog.getTable("db1", "parq_alter2") +) +assert(e.getMessage.contains("not equal to table prop path") + && e.getMessage.contains("parq_alter2")) + } Review comment: We will get a exception when the path property is not consistent with storage location. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 commented on pull request #28918: [SPARK-32068][WEBUI] Correct task lauchtime show issue due to timezone in stage tab
TJX2014 commented on pull request #28918: URL: https://github.com/apache/spark/pull/28918#issuecomment-651425597 Thanks all for your suggestion and attention very much :-) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 commented on a change in pull request #28926: [SPARK-32133][SQL] Forbid time field steps for date start/end in Sequence
TJX2014 commented on a change in pull request #28926: URL: https://github.com/apache/spark/pull/28926#discussion_r447317750 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala ## @@ -2623,8 +2628,16 @@ object Sequence { // about a month length in days and a day length in microseconds val intervalStepInMicros = stepMicros + stepMonths * microsPerMonth + stepDays * microsPerDay -val startMicros: Long = num.toLong(start) * scale -val stopMicros: Long = num.toLong(stop) * scale + +// Date to timestamp is not equal from GMT and Chicago timezones Review comment: @cloud-fan Thanks, I have followed the suggestion and make a new jira ticket for this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 commented on a change in pull request #28926: [SPARK-32133][SQL] Forbid time field steps for date start/end in Sequence
TJX2014 commented on a change in pull request #28926: URL: https://github.com/apache/spark/pull/28926#discussion_r447316906 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala ## @@ -2612,6 +2614,9 @@ object Sequence { val stepDays = step.days val stepMicros = step.microseconds + require(scale != MICROS_PER_DAY || stepMonths != 0 || stepDays != 0, +"sequence step must be a day interval if start and end values are dates") + if (stepMonths == 0 && stepMicros == 0 && scale == MICROS_PER_DAY) { Review comment: Seems we need the `require` check in eval, and I remove `SPARK-32198` branch code from this PR, is it ok ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28850: [SPARK-32015][Core]Remote inheritable thread local variables after spark context is stopped
holdenk commented on pull request #28850: URL: https://github.com/apache/spark/pull/28850#issuecomment-651421291 I could see this being useful in testing using something like `spark-testing-base`, you often want a fresh Spark context but not a whole fresh JVM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave
holdenk commented on pull request #28864: URL: https://github.com/apache/spark/pull/28864#issuecomment-651420282 Let me know when you've had a chance @tgravescs :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave
holdenk commented on pull request #28864: URL: https://github.com/apache/spark/pull/28864#issuecomment-651420495 > > The only other thing is that the use of the Mesos API stands out. We could address that as well. Types could be renamed when imported, or we could create subclasses and use those. Similarly, we could create traits that implement the Mesos method name and call `agentLost` instead. > > Thanks for taking a look :) I think the type renames make sense, thanks for suggesting it. I think for the `agentLost` function we could just add it to `MesosSchedulerUtils` instead of adding a new trait. Turns out this didn't work out, since we uses MesosSchedulerUtils in some non-scheduler places. Added the new trait instead :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28911: [SPARK-32077][CORE] Support host-local shuffle data reading when external shuffle service is disabled
holdenk commented on a change in pull request #28911: URL: https://github.com/apache/spark/pull/28911#discussion_r447311636 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -1391,10 +1391,11 @@ package object config { private[spark] val SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED = ConfigBuilder("spark.shuffle.readHostLocalDisk") - .doc(s"If enabled (and `${SHUFFLE_USE_OLD_FETCH_PROTOCOL.key}` is disabled and external " + -s"shuffle `${SHUFFLE_SERVICE_ENABLED.key}` is enabled), shuffle " + -"blocks requested from those block managers which are running on the same host are read " + -"from the disk directly instead of being fetched as remote blocks over the network.") + .doc(s"If enabled (and `${SHUFFLE_USE_OLD_FETCH_PROTOCOL.key}` is disabled and 1) external " + +s"shuffle `${SHUFFLE_SERVICE_ENABLED.key}` is enabled or 2) ${DYN_ALLOCATION_ENABLED.key}" + +s" is disabled), shuffle blocks requested from those block managers which are running on " + +s"the same host are read from the disk directly instead of being fetched as remote blocks" + +s" over the network.") Review comment: Maybe add some extra words around this depends on having non-isolated container storage (e.g. a shared hostPath in the k8s world) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhli1142015 commented on pull request #28949: [SPARK-32028][WEBUI][FOLLOWUP] fix app id link for multi attempts app in history summary page
zhli1142015 commented on pull request #28949: URL: https://github.com/apache/spark/pull/28949#issuecomment-651418623 @srowen , thanks for taking care this. This looks good to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28951: [SPARK-32131][SQL] union and set operations have wrong exception infomation
holdenk commented on pull request #28951: URL: https://github.com/apache/spark/pull/28951#issuecomment-651415959 Good catch. LGTM but I'll leave it for a bit of a SQL committer has any thoughts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28950: [SPARK-32094][PYTHON] Update cloudpickle to v1.4.1
holdenk commented on pull request #28950: URL: https://github.com/apache/spark/pull/28950#issuecomment-651414728 Thanks for the ping @dongjoon-hyun & thanks for working on this PR @codesue, I've been meaning to take a look at cloudpickle's updates. @viirya I think backporting cloudpickle changes is a bit risky given how core it is. I'd rather target the latest cloud pickle to Spark 3.1 if that sounds good to folks? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28946: [SPARK-32123][PYSPARK] Setting `spark.sql.session.timeZone` only partially respected
AmplabJenkins removed a comment on pull request #28946: URL: https://github.com/apache/spark/pull/28946#issuecomment-651096568 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28946: [SPARK-32123][PYSPARK] Setting `spark.sql.session.timeZone` only partially respected
holdenk commented on pull request #28946: URL: https://github.com/apache/spark/pull/28946#issuecomment-651413402 Jenkins ok to test cc @BryanCutler This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28944: [SPARK-32128][SQL]import SQLConf.PARTITION_OVERWRITE_VERIFY_PATH config
AmplabJenkins removed a comment on pull request #28944: URL: https://github.com/apache/spark/pull/28944#issuecomment-650986142 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28944: [SPARK-32128][SQL]import SQLConf.PARTITION_OVERWRITE_VERIFY_PATH config
holdenk commented on pull request #28944: URL: https://github.com/apache/spark/pull/28944#issuecomment-651413175 Jenkins ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28933: [SPARK-32104][SQL]Avoid full outer join OOM on skewed dataset
AmplabJenkins removed a comment on pull request #28933: URL: https://github.com/apache/spark/pull/28933#issuecomment-650117903 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28933: [SPARK-32104][SQL]Avoid full outer join OOM on skewed dataset
holdenk commented on pull request #28933: URL: https://github.com/apache/spark/pull/28933#issuecomment-651412836 Jenkins ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28924: [SPARK-32091][CORE] Ignore timeout error when remove blocks on the lost executor
holdenk commented on pull request #28924: URL: https://github.com/apache/spark/pull/28924#issuecomment-651412453 Also for `user facing` change maybe "less failures" which is good and we should call out here so we can mention it in the release notes and encourage folks to upgrade. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28924: [SPARK-32091][CORE] Ignore timeout error when remove blocks on the lost executor
holdenk commented on a change in pull request #28924: URL: https://github.com/apache/spark/pull/28924#discussion_r447303794 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -95,6 +97,13 @@ class BlockManagerMasterEndpoint( private val externalShuffleServiceRddFetchEnabled: Boolean = externalBlockStoreClient.isDefined private val externalShuffleServicePort: Int = StorageUtils.externalShuffleServicePort(conf) + private lazy val driverEndpoint = { Review comment: Would `makeDriverRef` in `RpcUtils` be appropriate here? ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -168,6 +177,37 @@ class BlockManagerMasterEndpoint( stop() } + private def handleFailure[T]( Review comment: I think this function could use a docstring. ## File path: core/src/main/scala/org/apache/spark/util/RpcUtils.scala ## @@ -54,6 +56,12 @@ private[spark] object RpcUtils { RpcTimeout(conf, Seq(RPC_LOOKUP_TIMEOUT.key, NETWORK_TIMEOUT.key), "120s") } + /** + * Infinite timeout is used internally, so there's no actual timeout property controls it. + * And timeout property should never be accessed since infinite means we never timeout. Review comment: I'm not sure I follow this sentence correctly, can you try and reword it? ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -350,11 +388,13 @@ class BlockManagerMasterEndpoint( if (locations != null) { locations.foreach { blockManagerId: BlockManagerId => val blockManager = blockManagerInfo.get(blockManagerId) -if (blockManager.isDefined) { +blockManager.foreach { bm => // Remove the block from the slave's BlockManager. // Doesn't actually wait for a confirmation and the message might get lost. // If message loss becomes frequent, we should add retry logic here. - blockManager.get.slaveEndpoint.ask[Boolean](RemoveBlock(blockId)) + bm.slaveEndpoint.ask[Boolean](RemoveBlock(blockId)).recover { +handleFailure("block", blockId.toString, bm.blockManagerId, false) Review comment: same comment as before. ## File path: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ## @@ -177,6 +180,95 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE blockManager.stop() } + private def setupBlockManagerMasterWithBlocks(withLost: Boolean): Unit = { +// set up a simple DriverEndpoint which simply adds executorIds and +// check whether a certain executorId has been added before. Review comment: nit:s/check/cheks/ ## File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala ## @@ -235,7 +273,9 @@ class BlockManagerMasterEndpoint( val removeMsg = RemoveShuffle(shuffleId) Future.sequence( blockManagerInfo.values.map { bm => -bm.slaveEndpoint.ask[Boolean](removeMsg) +bm.slaveEndpoint.ask[Boolean](removeMsg).recover { + handleFailure("shuffle", shuffleId.toString, bm.blockManagerId, false) Review comment: either add the comment as in the previous call, or pass by name here for clarity. ## File path: core/src/main/scala/org/apache/spark/util/RpcUtils.scala ## @@ -54,6 +56,12 @@ private[spark] object RpcUtils { RpcTimeout(conf, Seq(RPC_LOOKUP_TIMEOUT.key, NETWORK_TIMEOUT.key), "120s") } + /** + * Infinite timeout is used internally, so there's no actual timeout property controls it. + * And timeout property should never be accessed since infinite means we never timeout. + * */ Review comment: nit: `* */`, we use `*/` more commonly in spark. ## File path: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ## @@ -93,6 +94,7 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE .set(MEMORY_STORAGE_FRACTION, 0.999) .set(Kryo.KRYO_SERIALIZER_BUFFER_SIZE.key, "1m") .set(STORAGE_UNROLL_MEMORY_THRESHOLD, 512L) + .set(Network.RPC_ASK_TIMEOUT, "5s") Review comment: Any particular reason why 5? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] manuzhang commented on pull request #28916: [SPARK-32083][SQL] Coalesce to one partition when all partitions are empty in AQE
manuzhang commented on pull request #28916: URL: https://github.com/apache/spark/pull/28916#issuecomment-651407512 @viirya @cloud-fan I've updated the PR description with an example. This is more of an improvement I propose for certain cases. Please let me know whether it makes sense. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] warrenzhu25 commented on pull request #28942: [SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API
warrenzhu25 commented on pull request #28942: URL: https://github.com/apache/spark/pull/28942#issuecomment-651399906 > Hi @warrenzhu25 , thank you for your contribution. > This PR seems to add a new feature so could you add a testcase for it? > You can find tests for the status API in `UISeleniumSuite` and `HistoryServerSuite`. Added UT, but it seems doc build failed. It seems unrelated with my change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on pull request #28619: [SPARK-21040][CORE] Speculate tasks which are running on decommission executors
holdenk commented on pull request #28619: URL: https://github.com/apache/spark/pull/28619#issuecomment-651388307 Took a quick look, thanks for working on this. I think having a timeout to kill the executors regardless (e.g. a max decommissioning time) and the speculation are both useful. I'll follow up more once we've decided on the design in OSS. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #28948: [SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options
gengliangwang commented on a change in pull request #28948: URL: https://github.com/apache/spark/pull/28948#discussion_r447272447 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ## @@ -248,12 +248,17 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper { * `HadoopFsRelation` node(s) as part of its logical plan. */ def recacheByPath(spark: SparkSession, resourcePath: String): Unit = { Review comment: how about we change the method as ``` def recacheByPath(spark: SparkSession, resourcePath: String, options: Map[String, String]=Map.empty) ``` so that we can avoid the new method below? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r447249716 ## File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala ## @@ -242,8 +244,7 @@ private[spark] class BlockManager( private var blockReplicationPolicy: BlockReplicationPolicy = _ - private var blockManagerDecommissioning: Boolean = false - private var decommissionManager: Option[BlockManagerDecommissionManager] = None + @volatile private var decommissioner: Option[BlockManagerDecommissioner] = None Review comment: That's true. If we drop it we might also accept remove block puts after we've started decommissioning though. Depends on how much we want to avoid that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r447247802 ## File path: core/src/main/scala/org/apache/spark/storage/BlockId.scala ## @@ -40,6 +40,9 @@ sealed abstract class BlockId { def isRDD: Boolean = isInstanceOf[RDDBlockId] def isShuffle: Boolean = isInstanceOf[ShuffleBlockId] || isInstanceOf[ShuffleBlockBatchId] def isBroadcast: Boolean = isInstanceOf[BroadcastBlockId] + def isInternalShuffle: Boolean = { Review comment: Looking at it, not widely used I'll audit each use case and then decide. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r447247346 ## File path: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala ## @@ -148,6 +170,82 @@ private[spark] class IndexShuffleBlockResolver( } } + /** + * Write a provided shuffle block as a stream. Used for block migrations. + * ShuffleBlockBatchIds must contain the full range represented in the ShuffleIndexBlock. + * Requires the caller to delete any shuffle index blocks where the shuffle block fails to + * put. + */ + override def putShuffleBlockAsStream(blockId: BlockId, serializerManager: SerializerManager): + StreamCallbackWithID = { +val file = blockId match { + case ShuffleIndexBlockId(shuffleId, mapId, _) => +getIndexFile(shuffleId, mapId) + case ShuffleDataBlockId(shuffleId, mapId, _) => +getDataFile(shuffleId, mapId) + case _ => +throw new Exception(s"Unexpected shuffle block transfer ${blockId} as " + Review comment: sgtm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] holdenk commented on a change in pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
holdenk commented on a change in pull request #28708: URL: https://github.com/apache/spark/pull/28708#discussion_r447247000 ## File path: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala ## @@ -55,6 +58,25 @@ private[spark] class IndexShuffleBlockResolver( def getDataFile(shuffleId: Int, mapId: Long): File = getDataFile(shuffleId, mapId, None) + /** + * Get the shuffle files that are stored locally. Used for block migrations. + */ + override def getStoredShuffles(): Set[ShuffleBlockInfo] = { +// Matches ShuffleIndexBlockId name +val pattern = "shuffle_(\\d+)_(\\d+)_.+\\.index".r +val rootDirs = blockManager.diskBlockManager.localDirs +// ExecutorDiskUtil puts things inside one level hashed sub directories +val searchDirs = rootDirs.flatMap(_.listFiles()).filter(_.isDirectory()) ++ rootDirs +val filenames = searchDirs.flatMap(_.list()) +logDebug(s"Got block files ${filenames.toList}") +filenames.flatMap { fname => + pattern.findAllIn(fname).matchData.map { +matched => ShuffleBlockInfo(matched.group(1).toInt, matched.group(2).toLong) + } +}.toSet Review comment: There shouldn't be any duplicates, but if there are we only need to transfer one anyways. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #27620: [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files
HeartSaVioR commented on pull request #27620: URL: https://github.com/apache/spark/pull/27620#issuecomment-651313422 retest this, please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org