[GitHub] spark issue #23207: [SPARK-26193][SQL] Implement shuffle write metrics in SQ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23207 ```var writer: ShuffleWriter[Any, Any] = null try { val manager = SparkEnv.get.shuffleManager writer = manager.getWriter[Any, Any]( dep.shuffleHandle, partitionId, context, context.taskMetrics().shuffleWriteMetrics) writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) writer.stop(success = true).get } catch { case e: Exception => try { if (writer != null) { writer.stop(success = false) } } catch { case e: Exception => log.debug("Could not stop writer", e) } throw e }``` Can we put the above in a closure and pass it into shuffle dependency? Then in SQL we just put the above in SQL using custom metrics. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r239308829 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -170,13 +172,23 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared val df = testData2.groupBy().agg(collect_set('a)) // 2 partitions testSparkPlanMetrics(df, 1, Map( 2L -> (("ObjectHashAggregate", Map("number of output rows" -> 2L))), + 1L -> (("Exchange", Map( +"shuffle records written" -> 2L, +"records read" -> 2L, +"local blocks fetched" -> 2L, --- End diff -- yea i'd just change the display text here, and not change the api --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r239308706 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -95,3 +96,59 @@ private[spark] object SQLShuffleMetricsReporter { FETCH_WAIT_TIME -> SQLMetrics.createTimingMetric(sc, "fetch wait time"), RECORDS_READ -> SQLMetrics.createMetric(sc, "records read")) } + +/** + * A shuffle write metrics reporter for SQL exchange operators. Different with + * [[SQLShuffleReadMetricsReporter]], we need a function of (reporter => reporter) set in + * shuffle dependency, so the local SQLMetric should transient and create on executor. + * @param metrics Shuffle write metrics in current SparkPlan. + * @param metricsReporter Other reporter need to be updated in this SQLShuffleWriteMetricsReporter. + */ +private[spark] case class SQLShuffleWriteMetricsReporter( +metrics: Map[String, SQLMetric])(metricsReporter: ShuffleWriteMetricsReporter) + extends ShuffleWriteMetricsReporter with Serializable { + @transient private[this] lazy val _bytesWritten = +metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_BYTES_WRITTEN) + @transient private[this] lazy val _recordsWritten = +metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_RECORDS_WRITTEN) + @transient private[this] lazy val _writeTime = +metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_WRITE_TIME) + + override private[spark] def incBytesWritten(v: Long): Unit = { +metricsReporter.incBytesWritten(v) +_bytesWritten.add(v) + } + override private[spark] def decRecordsWritten(v: Long): Unit = { +metricsReporter.decBytesWritten(v) +_recordsWritten.set(_recordsWritten.value - v) + } + override private[spark] def incRecordsWritten(v: Long): Unit = { +metricsReporter.incRecordsWritten(v) +_recordsWritten.add(v) + } + override private[spark] def incWriteTime(v: Long): Unit = { +metricsReporter.incWriteTime(v) +_writeTime.add(v) + } + override private[spark] def decBytesWritten(v: Long): Unit = { +metricsReporter.decBytesWritten(v) +_bytesWritten.set(_bytesWritten.value - v) + } +} + +private[spark] object SQLShuffleWriteMetricsReporter { + val SHUFFLE_BYTES_WRITTEN = "shuffleBytesWritten" + val SHUFFLE_RECORDS_WRITTEN = "shuffleRecordsWritten" + val SHUFFLE_WRITE_TIME = "shuffleWriteTime" --- End diff -- yea i think we can just report ms level granularity. no point reporting ns (although we might want to measure based on ns) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r239308197 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -95,3 +96,59 @@ private[spark] object SQLShuffleMetricsReporter { FETCH_WAIT_TIME -> SQLMetrics.createTimingMetric(sc, "fetch wait time"), RECORDS_READ -> SQLMetrics.createMetric(sc, "records read")) } + +/** + * A shuffle write metrics reporter for SQL exchange operators. Different with + * [[SQLShuffleReadMetricsReporter]], we need a function of (reporter => reporter) set in + * shuffle dependency, so the local SQLMetric should transient and create on executor. + * @param metrics Shuffle write metrics in current SparkPlan. + * @param metricsReporter Other reporter need to be updated in this SQLShuffleWriteMetricsReporter. + */ +private[spark] case class SQLShuffleWriteMetricsReporter( +metrics: Map[String, SQLMetric])(metricsReporter: ShuffleWriteMetricsReporter) --- End diff -- why are there two parameter list here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r239308082 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -38,12 +38,18 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode override def outputPartitioning: Partitioning = SinglePartition override def executeCollect(): Array[InternalRow] = child.executeTake(limit) private val serializer: Serializer = new UnsafeRowSerializer(child.output.size) - override lazy val metrics = SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext) + private val writeMetrics = SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext) --- End diff -- why is metrics lazy val and this one val? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r239308007 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -38,12 +38,18 @@ case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode override def outputPartitioning: Partitioning = SinglePartition override def executeCollect(): Array[InternalRow] = child.executeTake(limit) private val serializer: Serializer = new UnsafeRowSerializer(child.output.size) - override lazy val metrics = SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext) + private val writeMetrics = SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext) + override lazy val metrics = --- End diff -- this is somewhat confusing. I'd create a variable for the read metrics so you can pass just that into the ShuffledRDD. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23207: [SPARK-26193][SQL] Implement shuffle write metrics in SQ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23207 @xuanyuanking can you separate the prs to rename read side metric and the write side change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r238845399 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -299,12 +312,25 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value") val df2 = (1 to 10).map(i => (i, i.toString)).toSeq.toDF("key", "value") // Assume the execution plan is - // ... -> ShuffledHashJoin(nodeId = 1) -> Project(nodeId = 0) + // Project(nodeId = 0) + // +- ShuffledHashJoin(nodeId = 1) + // :- Exchange(nodeId = 2) + // : +- Project(nodeId = 3) + // : +- LocalTableScan(nodeId = 4) + // +- Exchange(nodeId = 5) + // +- Project(nodeId = 6) + // +- LocalTableScan(nodeId = 7) val df = df1.join(df2, "key") testSparkPlanMetrics(df, 1, Map( 1L -> (("ShuffledHashJoin", Map( "number of output rows" -> 2L, - "avg hash probe (min, med, max)" -> "\n(1, 1, 1)" + "avg hash probe (min, med, max)" -> "\n(1, 1, 1)"))), +2L -> (("Exchange", Map( + "shuffle records written" -> 2L, + "records read" -> 2L))), --- End diff -- is this always going to be the same as "shuffle records written" ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r238845029 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -170,13 +172,23 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared val df = testData2.groupBy().agg(collect_set('a)) // 2 partitions testSparkPlanMetrics(df, 1, Map( 2L -> (("ObjectHashAggregate", Map("number of output rows" -> 2L))), + 1L -> (("Exchange", Map( +"shuffle records written" -> 2L, +"records read" -> 2L, +"local blocks fetched" -> 2L, --- End diff -- i think we should be consistent and name these "read", rather than "fetch". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r238843017 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala --- @@ -163,6 +171,8 @@ object SQLMetrics { Utils.bytesToString } else if (metricsType == TIMING_METRIC) { Utils.msDurationToString + } else if (metricsType == NANO_TIMING_METRIC) { +duration => Utils.msDurationToString(duration / 10) --- End diff -- is this the right conversion from nanosecs to millisecs? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r238842276 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala --- @@ -78,6 +78,7 @@ object SQLMetrics { private val SUM_METRIC = "sum" private val SIZE_METRIC = "size" private val TIMING_METRIC = "timing" + private val NANO_TIMING_METRIC = "nanosecond" --- End diff -- ns --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r238837000 --- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala --- @@ -50,3 +50,57 @@ private[spark] trait ShuffleWriteMetricsReporter { private[spark] def decBytesWritten(v: Long): Unit private[spark] def decRecordsWritten(v: Long): Unit } + + +/** + * A proxy class of ShuffleWriteMetricsReporter which proxy all metrics updating to the input + * reporters. + */ +private[spark] class GroupedShuffleWriteMetricsReporter( +reporters: Seq[ShuffleWriteMetricsReporter]) extends ShuffleWriteMetricsReporter { + override private[spark] def incBytesWritten(v: Long): Unit = { +reporters.foreach(_.incBytesWritten(v)) + } + override private[spark] def decRecordsWritten(v: Long): Unit = { +reporters.foreach(_.decRecordsWritten(v)) + } + override private[spark] def incRecordsWritten(v: Long): Unit = { +reporters.foreach(_.incRecordsWritten(v)) + } + override private[spark] def incWriteTime(v: Long): Unit = { +reporters.foreach(_.incWriteTime(v)) + } + override private[spark] def decBytesWritten(v: Long): Unit = { +reporters.foreach(_.decBytesWritten(v)) + } +} + + +/** + * A proxy class of ShuffleReadMetricsReporter which proxy all metrics updating to the input + * reporters. + */ +private[spark] class GroupedShuffleReadMetricsReporter( --- End diff -- Again - I think your old approach is much better. No point creating a general util when there is only one implementation without any known future needs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23207#discussion_r238836448 --- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala --- @@ -50,3 +50,57 @@ private[spark] trait ShuffleWriteMetricsReporter { private[spark] def decBytesWritten(v: Long): Unit private[spark] def decRecordsWritten(v: Long): Unit } + + +/** + * A proxy class of ShuffleWriteMetricsReporter which proxy all metrics updating to the input + * reporters. + */ +private[spark] class GroupedShuffleWriteMetricsReporter( --- End diff -- I'd not create a general API here. Just put one in SQL similar to the read side that also calls the default one. It can be expensive to go through a seq for each record and bytes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 Basically logically there are only two expressions: In which handles arbitrary expressions, and InSet which handles expressions with literals. Both could work: (1) we provide two separate expressions for InSet, one using switch, and one using hashset, or (2) we just provide one InSet and internally in InSet have two implementations ... The downside with creating different expressions for the same logical expression is that potentially the downstream optimization rules would need to match more. On Mon, Dec 03, 2018 at 10:52 PM, DB Tsai < notificati...@github.com > wrote: > > > > @ rxin ( https://github.com/rxin ) switch in Java is still significantly > faster than hash set even without boxing / unboxing problems when the > number of elements are small. We were thinking about to have two > implementations in InSet , and pick up switch if the number of elements are > small, or otherwise pick up hash set one. But this is the same complexity > as having two implements in In as this PR. > > > > @ cloud-fan ( https://github.com/cloud-fan ) do you suggest to create an OptimizeIn > which has switch and hash set implementations based on the length of the > elements and remove InSet ? Basically, what we were thinking above. > > > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub ( > https://github.com/apache/spark/pull/23171#issuecomment-443991336 ) , or mute > the thread ( > https://github.com/notifications/unsubscribe-auth/AATvPKtGyx5jWxgtO1y5WsiXYDAQqRQ4ks5u1hvJgaJpZM4Y4P4J > ). > > > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 I thought InSwitch logically is the same as InSet, in which all the child expressions are literals? On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < notificati...@github.com > wrote: > > > > I think InSet is not an optimized version of In , but just a way to > separate the implementation for different conditions (the length of the > list). Maybe we should do the same thing here, create a InSwitch and > convert In to it when meeting some conditions. One problem is, In and InSwitch > is same in the interpreted version, maybe we should create a base class > for them. > > > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub ( > https://github.com/apache/spark/pull/23171#issuecomment-443968486 ) , or mute > the thread ( > https://github.com/notifications/unsubscribe-auth/AATvPDTQic0Ii5UD40m_Uj5kMVy4pNExks5u1fxPgaJpZM4Y4P4J > ). > > > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 That probably means we should just optimize InSet to have the switch version though? Rather than do it in In? On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < notificati...@github.com > wrote: > > > > @ rxin ( https://github.com/rxin ) I proposed the same thing before, but > one problem is that, we only convert In to InSet when the length of list > reaches the threshold. If the switch way is faster than hash set when the > list is small, it seems still worth to optimize In using switch. > > > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub ( > https://github.com/apache/spark/pull/23171#issuecomment-443965616 ) , or mute > the thread ( > https://github.com/notifications/unsubscribe-auth/AATvPEkrUFJuT4FI167cCI9b0nfv16V4ks5u1fgNgaJpZM4Y4P4J > ). > > > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23171 I'm not a big fan of making the physical implementation of an expression very different depending on the situation. Why can't we just make InSet efficient and convert these cases to that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23192: [SPARK-26241][SQL] Add queryId to IncrementalExecution
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23192 Thanks @HyukjinKwon. Fixed it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23193: [SPARK-26226][SQL] Track optimization phase for s...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23193 [SPARK-26226][SQL] Track optimization phase for streaming queries ## What changes were proposed in this pull request? In an earlier PR, we missed measuring the optimization phase time for streaming queries. This patch adds it. ## How was this patch tested? Given this is a debugging feature, and it is very convoluted to add tests to verify the phase is set properly, I am not introducing a streaming specific test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-26226-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23193.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23193 commit 70c319bdaaac4fc4b8b988a96be6f976a63b41bf Author: Reynold Xin Date: 2018-12-01T04:33:21Z SPARK-26226 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23193: [SPARK-26226][SQL] Track optimization phase for streamin...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23193 cc @gatorsmile @jose-torres --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23192: [SPARK-26221][SQL] Add queryId to IncrementalExecution
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23192 cc @zsxwing @jose-torres --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23192: [SPARK-26221][SQL] Add queryId to IncrementalExec...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23192 [SPARK-26221][SQL] Add queryId to IncrementalExecution ## What changes were proposed in this pull request? This is a small change for better debugging: to pass query uuid in IncrementalExecution, when we look at the QueryExecution in isolation to trace back the query. ## How was this patch tested? N/A - just add some field for better debugging. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-26241 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23192 commit c037f4d2fa2c2844ac992d976b492e14ab9bed11 Author: Reynold Xin Date: 2018-12-01T04:27:00Z [SPARK-26221] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23183: [SPARK-26226][SQL] Update query tracker to report...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23183#discussion_r238019351 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala --- @@ -51,6 +58,18 @@ object QueryPlanningTracker { } } + /** + * Summary of a phase, with start time and end time so we can construct a timeline. + */ + class PhaseSummary(val startTimeMs: Long, val endTimeMs: Long) { + +def durationMs: Long = endTimeMs - startTimeMs + +override def toString: String = { + s"PhaseSummary($startTimeMs, $endTimeMs)" --- End diff -- so for actual debugging this is not needed right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23183: [SPARK-26226][SQL] Update query tracker to report timeli...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23183 cc @hvanhovell @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23183: [SPARK-26226][SQL] Update query tracker to report...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23183 [SPARK-26226][SQL] Update query tracker to report timeline for phases ## What changes were proposed in this pull request? This patch changes the query plan tracker added earlier to report phase timeline, rather than just a duration for each phase. This way, we can easily find time that's unaccounted for. ## How was this patch tested? Updated test cases to reflect that. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-26226 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23183.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23183 commit d200be22afd83472c03a612a22e5b1fb4d4d80ab Author: Reynold Xin Date: 2018-11-29T23:00:49Z [SPARK-26226][SQL] Update query tracker to report timeline for phases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23175: [SPARK-26142]followup: Move sql shuffle read metrics rel...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23175 LGTM - merged in master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23178: [SPARK-26216][SQL] Do not use case class as public API (...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23178 Good idea to have it sealed! > On Nov 29, 2018, at 7:04 AM, Sean Owen wrote: > > @srowen commented on this pull request. > > In sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala: > > > if (inputTypes.isDefined) { >assert(inputTypes.get.length == nullableTypes.get.length) > } > > +val inputsNullSafe = if (nullableTypes.isEmpty) { > You can use getOrElse here and even inline this into the call below, but I don't really care. > > In sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala: > > > @@ -38,114 +38,108 @@ import org.apache.spark.sql.types.DataType > * @since 1.3.0 > */ > @Stable > -case class UserDefinedFunction protected[sql] ( > -f: AnyRef, > -dataType: DataType, > -inputTypes: Option[Seq[DataType]]) { > - > - private var _nameOption: Option[String] = None > - private var _nullable: Boolean = true > - private var _deterministic: Boolean = true > - > - // This is a `var` instead of in the constructor for backward compatibility of this case class. > - // TODO: revisit this case class in Spark 3.0, and narrow down the public surface. > - private[sql] var nullableTypes: Option[Seq[Boolean]] = None > +trait UserDefinedFunction { > Should we make this sealed? I'm not sure. Would any user ever extend this meaningfully? I kind of worry someone will start doing so; maybe they already subclass it in some cases though. Elsewhere it might help the compiler understand in match statements that there is only ever one type of UDF class to match on. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or mute the thread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23128 @xuanyuanking @cloud-fan when you think about where to put each code block, make sure you also think about future evolution of the codebase. In general put relevant things closer to each other (e.g. in one class, one file, or one method). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r237129249 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala --- @@ -82,6 +82,14 @@ object SQLMetrics { private val baseForAvgMetric: Int = 10 + val REMOTE_BLOCKS_FETCHED = "remoteBlocksFetched" --- End diff -- rather than putting this list and the getShuffleReadMetrics function here, we should move it into SQLShuffleMetricsReporter. Otherwise in the future when one adds another metric, he/she is likely to forget to update SQLShuffleMetricsReporter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r237128247 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.metric + +import org.apache.spark.executor.TempShuffleReadMetrics + +/** + * A shuffle metrics reporter for SQL exchange operators. + * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext. + * @param metrics All metrics in current SparkPlan. This param should not empty and + * contains all shuffle metrics defined in [[SQLMetrics.getShuffleReadMetrics]]. + */ +private[spark] class SQLShuffleMetricsReporter( + tempMetrics: TempShuffleReadMetrics, --- End diff -- 4 space indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r237128189 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala --- @@ -194,4 +202,16 @@ object SQLMetrics { SparkListenerDriverAccumUpdates(executionId.toLong, metrics.map(m => m.id -> m.value))) } } + + /** + * Create all shuffle read relative metrics and return the Map. + */ + def getShuffleReadMetrics(sc: SparkContext): Map[String, SQLMetric] = Map( --- End diff -- I'd prefer to name this create, rather than get, to imply we are creating a new set rather than just returning some existing sets. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r236845375 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -38,7 +38,7 @@ import org.apache.spark.sql.execution.datasources.jdbc._ import org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils -import org.apache.spark.sql.sources.v2.{BatchReadSupportProvider, DataSourceOptions, DataSourceV2} +import org.apache.spark.sql.sources.v2._ --- End diff -- I do think this one is too nitpicking. If this gets long it should be wildcard. Use an IDE for large reviews like this if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23106: [SPARK-26141] Enable custom metrics implementation in sh...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23106 Merging in master. Thanks @squito. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r236492408 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; --- End diff -- Everything in catalyst is considered private (although public visibility for debugging) and it's best to stay that way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23106: [SPARK-26141] Enable custom metrics implementatio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23106#discussion_r236432889 --- Diff: core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java --- @@ -242,8 +243,13 @@ private void writeSortedFile(boolean isLastFile) { // Note that we intentionally ignore the value of `writeMetricsToUse.shuffleWriteTime()`. // Consistent with ExternalSorter, we do not count this IO towards shuffle write time. // This means that this IO time is not accounted for anywhere; SPARK-3577 will fix this. - writeMetrics.incRecordsWritten(writeMetricsToUse.recordsWritten()); - taskContext.taskMetrics().incDiskBytesSpilled(writeMetricsToUse.bytesWritten()); + + // This is guaranteed to be a ShuffleWriteMetrics based on the if check in the beginning + // of this file. --- End diff -- ah yes. nice catch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23147: [SPARK-26140] followup: rename ShuffleMetricsReporter
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23147 cc @gatorsmile @xuanyuanking @cloud-fan I misunderstood your comment. Finally saw it today when I was looking at my other PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23147: [SPARK-26140] followup: rename ShuffleMetricsRepo...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23147 [SPARK-26140] followup: rename ShuffleMetricsReporter ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/23105, due to working on two parallel PRs at once, I made the mistake of committing the copy of the PR that used the name ShuffleMetricsReporter for the interface, rather than the appropriate one ShuffleReadMetricsReporter. This patch fixes that. ## How was this patch tested? This should be fine as long as compilation passes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark ShuffleReadMetricsReporter Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23147.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23147 commit 1d28d879572aa958b169acc5e1a48e52cced4c26 Author: Reynold Xin Date: 2018-11-26T18:56:18Z ShuffleReadMetricsReporter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23135: [SPARK-26168][SQL] Update the code comments in Ex...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23135#discussion_r236089467 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -575,6 +575,19 @@ case class Range( } } +/** + * This is a Group by operator with the aggregate functions and projections. + * + * @param groupingExpressions expressions for grouping keys + * @param aggregateExpressions expressions for a project list, which could contain + * [[org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction]]s. + * + * Note: Currently, aggregateExpressions correspond to both [[AggregateExpression]] and the output --- End diff -- It is not clear what âresultExpressionsâ mean. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23131: [SPARK-25908][SQL][FOLLOW-UP] Add back unionAll
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23131#discussion_r236052557 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1852,6 +1852,19 @@ class Dataset[T] private[sql]( CombineUnions(Union(logicalPlan, other.logicalPlan)) } + /** + * Returns a new Dataset containing union of rows in this Dataset and another Dataset. --- End diff -- say that this is an alias of union. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23129: [MINOR] Update all DOI links to preferred resolver
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23129 Jenkins, test this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r236025838 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.metric + +import org.apache.spark.executor.TempShuffleReadMetrics + +/** + * A shuffle metrics reporter for SQL exchange operators. + * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext. + * @param metrics All metrics in current SparkPlan. + */ +class SQLShuffleMetricsReporter( + tempMetrics: TempShuffleReadMetrics, + metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics { + + override def incRemoteBlocksFetched(v: Long): Unit = { +metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v) --- End diff -- (Iâm not referring to just this function, but in general, especially for per-row). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23128#discussion_r236025817 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.metric + +import org.apache.spark.executor.TempShuffleReadMetrics + +/** + * A shuffle metrics reporter for SQL exchange operators. + * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext. + * @param metrics All metrics in current SparkPlan. + */ +class SQLShuffleMetricsReporter( + tempMetrics: TempShuffleReadMetrics, + metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics { + + override def incRemoteBlocksFetched(v: Long): Unit = { +metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v) --- End diff -- Doing a hashmap lookup here could introduce serious performance regressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23105#discussion_r236020103 --- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle + +/** + * An interface for reporting shuffle read metrics, for each shuffle. This interface assumes + * all the methods are called on a single-threaded, i.e. concrete implementations would not need + * to synchronize. + * + * All methods have additional Spark visibility modifier to allow public, concrete implementations + * that still have these methods marked as private[spark]. + */ +private[spark] trait ShuffleReadMetricsReporter { --- End diff -- @xuanyuanking just submitted a PR on how to use it :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23105#discussion_r235950427 --- Diff: core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala --- @@ -48,7 +48,8 @@ private[spark] trait ShuffleManager { handle: ShuffleHandle, startPartition: Int, endPartition: Int, - context: TaskContext): ShuffleReader[K, C] + context: TaskContext, + metrics: ShuffleMetricsReporter): ShuffleReader[K, C] --- End diff -- It is a read metrics here actually. In the write PR this is renamed ShuffleReadMetricsReporter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23110: [SPARK-26129] Followup - edge behavior for QueryPlanning...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23110 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23110: [SPARK-26129] Followup - edge behavior for QueryP...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23110 [SPARK-26129] Followup - edge behavior for QueryPlanningTracker.topRulesByTime ## What changes were proposed in this pull request? This is an addendum patch for SPARK-26129 that defines the edge case behavior for QueryPlanningTracker.topRulesByTime. ## How was this patch tested? Added unit tests for each behavior. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-26129-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23110 commit 683630ac3fbf054534e2589258793c9baaebfbf5 Author: Reynold Xin Date: 2018-11-21T22:25:09Z [SPARK-26129] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23106: [SPARK-26141] Enable custom shuffle metrics imple...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23106 [SPARK-26141] Enable custom shuffle metrics implementation in shuffle write ## What changes were proposed in this pull request? This is the write side counterpart to https://github.com/apache/spark/pull/23105 ## How was this patch tested? No behavior change expected, as it is a straightforward refactoring. Updated all existing test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-26141 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23106.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23106 commit 115bd8bfa49674a2fcfa05517373146e90ec3bf7 Author: Reynold Xin Date: 2018-11-21T15:55:56Z [SPARK-26141] Enable custom shuffle metrics implementation in shuffle write --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23105: [SPARK-26140] Enable custom metrics implementation in sh...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23105 cc @jiangxb1987 @squito --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23096: [SPARK-26129][SQL] Instrumentation for per-query plannin...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23096 Merging this. Feel free to leave more comments. I'm hoping we can wire this into the UI eventually. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23105: [SPARK-26140] Enable passing in a custom shuffle ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23105#discussion_r235420647 --- Diff: core/src/main/scala/org/apache/spark/executor/ShuffleReadMetrics.scala --- @@ -122,34 +123,3 @@ class ShuffleReadMetrics private[spark] () extends Serializable { } } } - -/** - * A temporary shuffle read metrics holder that is used to collect shuffle read metrics for each - * shuffle dependency, and all temporary metrics will be merged into the [[ShuffleReadMetrics]] at - * last. - */ -private[spark] class TempShuffleReadMetrics { --- End diff -- this was moved to TempShuffleReadMetrics --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23105: [SPARK-26140] Pull TempShuffleReadMetrics creatio...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23105 [SPARK-26140] Pull TempShuffleReadMetrics creation out of shuffle reader ## What changes were proposed in this pull request? This patch defines an internal Spark interface for reporting shuffle metrics and uses that in shuffle reader. Before this patch, shuffle metrics is tied to a specific implementation (using a thread local temporary data structure and accumulators). After this patch, callers that define their own shuffle RDDs can create a custom metrics implementation. With this patch, we would be able to create a better metrics for the SQL layer, e.g. reporting shuffle metrics in the SQL UI, for each exchange operator. ## How was this patch tested? No behavior change expected, as it is a straightforward refactoring. Updated all existing test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-26140 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23105 commit da253b57c14bc0174f0330ae6fa5d3a61647269b Author: Reynold Xin Date: 2018-11-21T14:56:23Z [SPARK-26140] Pull TempShuffleReadMetrics creation out of shuffle reader --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23096#discussion_r235309483 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -648,7 +648,11 @@ class SparkSession private( * @since 2.0.0 */ def sql(sqlText: String): DataFrame = { -Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText)) +val tracker = new QueryPlanningTracker --- End diff -- I don't think it makes sense to add random flags for everything. If the argument is that this change has a decent chance of introducing regressions (e.g. due to higher memory usage, or cpu overhead), then it would make a lot of sense to put it behind a flag so it can be disabled in production if that happens. That said, the overhead on the hot code path here is substantially smaller than even transforming the simplest Catalyst plan (hash map look up is orders of magnitude cheaper than calling a partial function to transform a Scala collection for TreeNode), so I think the risk is so low that it does not warrant adding a config. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23100 Change of this type can really piss some people off. Was there consensus on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23096#discussion_r235182105 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala --- @@ -88,15 +101,20 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging { val startTime = System.nanoTime() val result = rule(plan) val runTime = System.nanoTime() - startTime +val effective = !result.fastEquals(plan) -if (!result.fastEquals(plan)) { +if (effective) { queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName) queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime) planChangeLogger.log(rule.ruleName, plan, result) } queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, runTime) queryExecutionMetrics.incNumExecution(rule.ruleName) +if (tracker ne null) { --- End diff -- if one calls execute directly tracker would be null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23096#discussion_r235162047 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala --- @@ -88,15 +92,18 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging { val startTime = System.nanoTime() val result = rule(plan) val runTime = System.nanoTime() - startTime +val effective = !result.fastEquals(plan) -if (!result.fastEquals(plan)) { +if (effective) { queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName) queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime) planChangeLogger.log(rule.ruleName, plan, result) } queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, runTime) queryExecutionMetrics.incNumExecution(rule.ruleName) +tracker.foreach(_.recordRuleInvocation(rule.ruleName, runTime, effective)) --- End diff -- yes! (not great -- but I'd probably remove the global tracker at some point) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23096#discussion_r235161825 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -696,7 +701,7 @@ class Analyzer( s"avoid errors. Increase the value of ${SQLConf.MAX_NESTED_VIEW_DEPTH.key} to work " + "around this.") } - executeSameContext(child) + executeSameContext(child, None) --- End diff -- No great reason. I just used None for everything, except the top level, because it is very difficult to wire the tracker here without refactoring a lot of code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23096#discussion_r235161336 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst + +import scala.collection.JavaConverters._ + +import org.apache.spark.util.BoundedPriorityQueue + + +/** + * A simple utility for tracking runtime and associated stats in query planning. + * + * There are two separate concepts we track: + * + * 1. Phases: These are broad scope phases in query planning, as listed below, i.e. analysis, + * optimizationm and physical planning (just planning). + * + * 2. Rules: These are the individual Catalyst rules that we track. In addition to time, we also + * track the number of invocations and effective invocations. + */ +object QueryPlanningTracker { + + // Define a list of common phases here. + val PARSING = "parsing" --- End diff -- Mostly because Scala enum is not great, and I was thinking about making this a generic thing that's extensible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23096: [SPARK-26129][SQL] Instrumentation for per-query plannin...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23096 cc @hvanhovell @gatorsmile This is different from the existing metrics for rules as it is query specific. We might want to replace that one with this in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for query plan...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/23096 [SPARK-26129][SQL] Instrumentation for query planning time ## What changes were proposed in this pull request? We currently don't have good visibility into query planning time (analysis vs optimization vs physical planning). This patch adds a simple utility to track the runtime of various rules and various planning phases. ## How was this patch tested? Added unit tests and end-to-end integration tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-26129 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23096.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23096 commit b6a3d02f2c2b0eff71f92c3ede854edc3b5bf9f8 Author: Reynold Xin Date: 2018-11-20T16:22:35Z [SPARK-26129][SQL] Instrumentation for query planning time --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of non-struct ty...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/23054#discussion_r234569150 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1594,6 +1594,15 @@ object SQLConf { "WHERE, which does not follow SQL standard.") .booleanConf .createWithDefault(false) + + val LEGACY_ALIAS_NON_STRUCT_GROUPING_KEY = +buildConf("spark.sql.legacy.dataset.aliasNonStructGroupingKey") --- End diff -- Maybe aliasNonStructGroupingKeyAsValue, and default to true. Then we can remove this in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23054 BTW what does the non-primitive types look like? Do they get flattened, or is there a strict? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23054 We should add a âlegacyâ flag in case somebodyâs workload gets broken by this. We can remove the legacy flag in a future release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18784 Go for it. On Fri, Nov 16, 2018 at 6:08 AM Stavros Kontopoulos < notificati...@github.com> wrote: > @imaxxs <https://github.com/imaxxs> @rxin <https://github.com/rxin> I > think its a good time to remove this, I will update the PR if you are all > ok. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/18784#issuecomment-439403392>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPI-PKZYYhazC7vTtoMHqJv9eA-xlks5uvsbegaJpZM4OoOmC> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23021 One thing - I would put âpandasâ right after test_ so you get the natural logical grouping with sorting by file name. On Tue, Nov 13, 2018 at 4:58 PM Hyukjin Kwon wrote: > I am going to push after testing and double checking. The line counts > would look like this > > 54 ./test_utils.py > 199 ./test_catalog.py > 503 ./test_grouped_agg_pandas_udf.py > 45 ./test_group.py > 320 ./test_session.py > 153 ./test_readwriter.py > 806 ./test_scalar_pandas_udf.py > 216 ./test_pandas_udf.py > 566 ./test_streaming.py > 55 ./test_conf.py > 16 ./__init__.py > 530 ./test_grouped_map_pandas_udf.py > 157 ./test_column.py > 654 ./test_udf.py > 262 ./test_window_pandas_udf.py > 278 ./test_functions.py > 263 ./test_context.py > 138 ./test_serde.py > 170 ./test_datasources.py > 399 ./test_arrow.py > 96 ./test_appsubmit.py > 944 ./test_types.py > 737 ./test_dataframe.py > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/23021#issuecomment-438497006>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPOg1IR6S5Fc4qv2mrPWTsDRRxf1Qks5uu2qagaJpZM4YbTYj> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23021 Great initiative! I'd break the pandas udf one into smaller pieces too, as you suggested. We should also investigate why the runtime didn't improve ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22957 i didn't look at your new code, but is your old code safe? e.g. a project that depends on the new alias. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15899 Thanks for the example. I didn't even know that was possible in earlier versions. I just looked it up: looks like Scala 2.11 rewrites for comprehensions into map, filter, and flatMap. That said, I don't think it's a bad deal that this no longer works, given it was never intended to work and there's been a deprecation warning. I still maintain that it is risky to support this, because Scala users learn for comprehension not just for a simple "for filter yield", but as a way to chain multiple generators together, which is not really well supported by Spark (even if it is, it's a really bad operation for users to shoot themselves in the foot because it would be a cartesian product). Rather than faking it as a local collection, users should know RDD is not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15899: [SPARK-18466] added withFilter method to RDD
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15899#discussion_r231390266 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -387,6 +387,14 @@ abstract class RDD[T: ClassTag]( preservesPartitioning = true) } + /** +* Return a new RDD containing only the elements that satisfy a predicate. --- End diff -- Why bother unless we have consensus to introduce this API? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22889: [SPARK-25882][SQL] Added a function to join two datasets...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22889 Yea good idea (prefer Array over Seq for short lists) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22921 seems good to me; might want to leave this open for a few days so more people can take a look --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230135473 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -62,17 +62,6 @@ class SQLContext private[sql](val sparkSession: SparkSession) sparkSession.sparkContext.assertNotStopped() - // Note: Since Spark 2.0 this class has become a wrapper of SparkSession, where the --- End diff -- keep these two lines? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22921#discussion_r230132632 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -639,20 +639,6 @@ private[spark] object SparkConf extends Logging { */ private val deprecatedConfigs: Map[String, DeprecatedConfig] = { val configs = Seq( - DeprecatedConfig("spark.cache.class", "0.8", --- End diff -- do we need to remove these? they are warnings for users if they set the wrong config right --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22830 Perhaps @jkbradley and @mengxr can comment on it. If the trait is inheritable, then protected still means it is part of the API contract. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22830 Who introduced this? We should ask the person that introduced it whether it can be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22870: [SPARK-25862][SQL] Remove rangeBetween APIs intro...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/22870 [SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? This patch removes the rangeBetween functions introduced in SPARK-21608. As explained in SPARK-25841, these functions are confusing and don't quite work. We will redesign them and introduce better ones in SPARK-25843. ## How was this patch tested? Removed relevant test cases as well. These test cases will need to be added back in SPARK-25843. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-25862 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22870.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22870 commit 00b0c6c746f4dbd3aa69071b99cf09bc5b53524a Author: Reynold Xin Date: 2018-10-28T21:11:16Z [SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22853: [SPARK-25845][SQL] Fix MatchError for calendar in...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22853#discussion_r228608016 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala --- @@ -267,6 +267,25 @@ class DataFrameWindowFramesSuite extends QueryTest with SharedSQLContext { ) } + test("range between should accept interval values as left boundary") { --- End diff -- this is using intervals for both, rather than just left. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22815: [SPARK-25821][SQL] Remove SQLContext methods depr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22815#discussion_r228594291 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -54,6 +54,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager * @groupname Ungrouped Support functions for language integrated queries * @since 1.0.0 */ +@deprecated("Use SparkSession instead", "3.0.0") --- End diff -- Yea I wouldn't deprecate it now ... the data source API v1 still depends on it. Actually now I think about it, we should not be deprecate SQLContext until dsv2 is stable. Otherwise we have a stable API dsv1 depending on a deprecated API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21588 Does this upgrade Hive for execution or also for metastore? Spark supports virtually all Hive metastore versions out there, and a lot of deployments do run different versions of Spark against the same old Hive metastore, and it'd be bad to break connectivity to old Hive metastores. The execution part is a different story and we can upgrade them easily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22841#discussion_r228376622 --- Diff: python/pyspark/sql/window.py --- @@ -239,34 +212,27 @@ def rangeBetween(self, start, end): and "5" means the five off after the current row. We recommend users use ``Window.unboundedPreceding``, ``Window.unboundedFollowing``, -``Window.currentRow``, ``pyspark.sql.functions.unboundedPreceding``, -``pyspark.sql.functions.unboundedFollowing`` and ``pyspark.sql.functions.currentRow`` -to specify special boundary values, rather than using integral values directly. +and ``Window.currentRow`` to specify special boundary values, rather than using integral --- End diff -- what do you mean? the old rangeBetween API is still valid. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22775#discussion_r228372331 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -770,8 +776,17 @@ case class SchemaOfJson( factory } - override def convert(v: UTF8String): UTF8String = { -val dt = Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, v)) { parser => + @transient + private lazy val json = child.eval().asInstanceOf[UTF8String] --- End diff -- It's not weird that users want to use schema_of_json at all. Imagine it's a very large json with very complicated string. It's pretty difficult to actually write the ddl string. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22775 I agree it should be a literal value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/22841 [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? See the detailed information at https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be deprecated and redesigned. ## How was this patch tested? Only deprecation and doc changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-25842 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22841.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22841 commit 0a49c859049a376872053dcfaacba81d47070d77 Author: Reynold Xin Date: 2018-10-25T23:44:36Z [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22821: [SPARK-25832][SQL] remove newly added map related functi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22821 We seem to be splitting hairs here. Why are we providing tech preview to advanced users? Are you saying they construct expressions directly using internal APIs? I doubt thatâs tech preview. Users can construct a lot of invalid plans that lead to weird semantics or behaviors if they try, and this doesnât really make it worse. In either case itâs not that difficult to remove them and add them back so I could see it going either way. On Thu, Oct 25, 2018 at 7:48 AM Dongjoon Hyun wrote: > I'm just confused here. Shall we finish the discussion on the email > thread? @cloud-fan <https://github.com/cloud-fan> and @gatorsmile > <https://github.com/gatorsmile> . If the decision is officially made like > that (providing tech. preview to advance users) in the email thread, I'm > okay with this. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/22821#issuecomment-433080200>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPODR0wtirRLTLDDSb7agOF-U8fqLks5uoc8ogaJpZM4X5fHp> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22815: [SPARK-25821][SQL] Remove SQLContext methods deprecated ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22815 LGTM. On a related note, we should probably deprecate the entire SQLContext. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22144 @markhamstra how did you arrive at that conclusion? I said "itâs not a new regression and also somewhat esoteric" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22144 Itâs certainly not a blocker since itâs not a new regression and also somewhat esoteric. Would be good to fix though. On Tue, Oct 23, 2018 at 8:20 AM Wenchen Fan wrote: > This is not a PR that is ready to merge. We are likely talking about > delaying 2.4.0 for multiple weeks because of this issue. Is it really worth? > > I'm not sure what's the exact policy, let's ping more people. @rxin > <https://github.com/rxin> @srowen <https://github.com/srowen> @vanzin > <https://github.com/vanzin> @felixcheung <https://github.com/felixcheung> > @gatorsmile <https://github.com/gatorsmile> > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/22144#issuecomment-432287845>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPF8mJeJgi8oznKX1_RA-iXyRH9jJks5unzPKgaJpZM4WDFyL> > . > --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21157 But that would break both ipython notebooks and repl right? Pretty significant breaking change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22010 If this is not yet in 2.4 it shouldnât be merged now. On Wed, Oct 10, 2018 at 10:57 AM Holden Karau wrote: > Open question: is this suitable for branch-2.4 since it predates the > branch cut or not? (I know we've gone back and forth on how we do that). > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/22010#issuecomment-428492653>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AATvPO6Nlv4HOCVe9pPZfCd1GHXoVCDxks5ujbZlgaJpZM4Vw2BM> > . > -- -x --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21157 @superbobry which blog were you referring to? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21157 so this change would introduce a pretty big regression? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22543: [SPARK-23715][SQL][DOC] improve document for from...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/22543#discussion_r220410457 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1018,9 +1018,20 @@ case class TimeAdd(start: Expression, interval: Expression, timeZoneId: Option[S } /** - * Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time in UTC, and renders - * that time as a timestamp in the given time zone. For example, 'GMT+1' would yield - * '2017-07-14 03:40:00.0'. + * This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function + * takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and + * renders that timestamp as a timestamp in the given time zone. + * + * However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not + * timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to + * the given timezone. + * + * This function may return confusing result if the input is a string with timezone, e.g. + * '2018-03-13T06:18:23+00:00'. The reason is that, Spark firstly cast the string to timestamp + * according to the timezone in the string, and finally display the result by converting the + * timestamp to string according to the session local timezone. + * + * We may remove this function in Spark 3.0. --- End diff -- should also update the sql doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22521: [SPARK-24519][CORE] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIG...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22521 seems like our tests are really flaky --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22521 yup; just did --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22541: [SPARK-23907][SQL] Revert regr_* functions entire...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/22541 [SPARK-23907][SQL] Revert regr_* functions entirely ## What changes were proposed in this pull request? This patch reverts entirely all the regr_* functions added in SPARK-23907. These were added by @mgaido91 (and proposed by @gatorsmile) to improve compatibility with other database systems, without any actual use cases. However, they are very rarely used, and in Spark there are much better ways to compute these functions, due to Spark's flexibility in exposing real programming APIs. I'm going through all the APIs added in Spark 2.4 and I think we should revert these. If there are strong enough demands and more use cases, we can add them back in the future pretty easily. ## How was this patch tested? Reverted test cases also. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-23907 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22541.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22541 commit 623e35f118e3d28a49eb84365079d037fa519186 Author: Reynold Xin Date: 2018-09-24T21:30:43Z [SPARK-23907][SQL] Revert regr_* functions entirely commit ef0f5b02dbce29d6fbaa6f79f9c2ad62e7a16bb0 Author: Reynold Xin Date: 2018-09-24T21:34:34Z i --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22521 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HI...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/22521 [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS only once - WIP ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-24519 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22521.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22521 commit 77442cf7e4b64b745079a1ee62684503c7b8c123 Author: Reynold Xin Date: 2018-09-19T00:58:24Z [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS only once commit f23c2202fbec04983d1181d92f7c124280ebcbe3 Author: Reynold Xin Date: 2018-09-21T16:48:59Z Merge branch 'master' of github.com:apache/spark into SPARK-24519 commit ac3dee3227e4ceee4ec100bbe72988f791ae3c87 Author: Reynold Xin Date: 2018-09-21T16:49:52Z x commit f6f9658e19ae5e74697ee8846b6ab11ab8eba24c Author: Reynold Xin Date: 2018-09-21T17:31:51Z fix conflict --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21527: [SPARK-24519] Make the threshold for highly compr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/21527#discussion_r219559889 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -50,7 +50,9 @@ private[spark] sealed trait MapStatus { private[spark] object MapStatus { def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { -if (uncompressedSizes.length > 2000) { +if (uncompressedSizes.length > Option(SparkEnv.get) --- End diff -- the only tricky thing is how to write the test cases for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22515: [SPARK-19724][SQL] allowCreatingManagedTableUsing...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/22515 [SPARK-19724][SQL] allowCreatingManagedTableUsingNonemptyLocation should have legacy prefix One more legacy config to go ... You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark allowCreatingManagedTableUsingNonemptyLocation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22515.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22515 commit f7c372e6f803c86e189e984fa6c1dd81f84454e9 Author: Reynold Xin Date: 2018-09-21T02:10:10Z [SPARK-19724][SQL] allowCreatingManagedTableUsingNonemptyLocation should have legacy prefix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22456: [SPARK-19355][SQL] Fix variable names numberOfOut...
Github user rxin closed the pull request at: https://github.com/apache/spark/pull/22456 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22509: [SPARK-25384][SQL] Clarify fromJsonForceNullableSchema w...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/22509 cc @dongjoon-hyun @MaxGekk we still need this pr don't we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org