[GitHub] spark issue #23207: [SPARK-26193][SQL] Implement shuffle write metrics in SQ...

2018-12-06 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23207
  
```var writer: ShuffleWriter[Any, Any] = null
try {
  val manager = SparkEnv.get.shuffleManager
  writer = manager.getWriter[Any, Any](
dep.shuffleHandle, partitionId, context, 
context.taskMetrics().shuffleWriteMetrics)
  writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ 
<: Product2[Any, Any]]])
  writer.stop(success = true).get
} catch {
  case e: Exception =>
try {
  if (writer != null) {
writer.stop(success = false)
  }
} catch {
  case e: Exception =>
log.debug("Could not stop writer", e)
}
throw e
}```

Can we put the above in a closure and pass it into shuffle dependency? Then 
in SQL we just put the above in SQL using custom metrics.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r239308829
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -170,13 +172,23 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
 val df = testData2.groupBy().agg(collect_set('a)) // 2 partitions
 testSparkPlanMetrics(df, 1, Map(
   2L -> (("ObjectHashAggregate", Map("number of output rows" -> 2L))),
+  1L -> (("Exchange", Map(
+"shuffle records written" -> 2L,
+"records read" -> 2L,
+"local blocks fetched" -> 2L,
--- End diff --

yea i'd just change the display text here, and not change the api


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r239308706
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -95,3 +96,59 @@ private[spark] object SQLShuffleMetricsReporter {
 FETCH_WAIT_TIME -> SQLMetrics.createTimingMetric(sc, "fetch wait 
time"),
 RECORDS_READ -> SQLMetrics.createMetric(sc, "records read"))
 }
+
+/**
+ * A shuffle write metrics reporter for SQL exchange operators. Different 
with
+ * [[SQLShuffleReadMetricsReporter]], we need a function of (reporter => 
reporter) set in
+ * shuffle dependency, so the local SQLMetric should transient and create 
on executor.
+ * @param metrics Shuffle write metrics in current SparkPlan.
+ * @param metricsReporter Other reporter need to be updated in this 
SQLShuffleWriteMetricsReporter.
+ */
+private[spark] case class SQLShuffleWriteMetricsReporter(
+metrics: Map[String, SQLMetric])(metricsReporter: 
ShuffleWriteMetricsReporter)
+  extends ShuffleWriteMetricsReporter with Serializable {
+  @transient private[this] lazy val _bytesWritten =
+metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_BYTES_WRITTEN)
+  @transient private[this] lazy val _recordsWritten =
+metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_RECORDS_WRITTEN)
+  @transient private[this] lazy val _writeTime =
+metrics(SQLShuffleWriteMetricsReporter.SHUFFLE_WRITE_TIME)
+
+  override private[spark] def incBytesWritten(v: Long): Unit = {
+metricsReporter.incBytesWritten(v)
+_bytesWritten.add(v)
+  }
+  override private[spark] def decRecordsWritten(v: Long): Unit = {
+metricsReporter.decBytesWritten(v)
+_recordsWritten.set(_recordsWritten.value - v)
+  }
+  override private[spark] def incRecordsWritten(v: Long): Unit = {
+metricsReporter.incRecordsWritten(v)
+_recordsWritten.add(v)
+  }
+  override private[spark] def incWriteTime(v: Long): Unit = {
+metricsReporter.incWriteTime(v)
+_writeTime.add(v)
+  }
+  override private[spark] def decBytesWritten(v: Long): Unit = {
+metricsReporter.decBytesWritten(v)
+_bytesWritten.set(_bytesWritten.value - v)
+  }
+}
+
+private[spark] object SQLShuffleWriteMetricsReporter {
+  val SHUFFLE_BYTES_WRITTEN = "shuffleBytesWritten"
+  val SHUFFLE_RECORDS_WRITTEN = "shuffleRecordsWritten"
+  val SHUFFLE_WRITE_TIME = "shuffleWriteTime"
--- End diff --

yea i think we can just report ms level granularity. no point reporting ns 
(although we might want to measure based on ns)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r239308197
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -95,3 +96,59 @@ private[spark] object SQLShuffleMetricsReporter {
 FETCH_WAIT_TIME -> SQLMetrics.createTimingMetric(sc, "fetch wait 
time"),
 RECORDS_READ -> SQLMetrics.createMetric(sc, "records read"))
 }
+
+/**
+ * A shuffle write metrics reporter for SQL exchange operators. Different 
with
+ * [[SQLShuffleReadMetricsReporter]], we need a function of (reporter => 
reporter) set in
+ * shuffle dependency, so the local SQLMetric should transient and create 
on executor.
+ * @param metrics Shuffle write metrics in current SparkPlan.
+ * @param metricsReporter Other reporter need to be updated in this 
SQLShuffleWriteMetricsReporter.
+ */
+private[spark] case class SQLShuffleWriteMetricsReporter(
+metrics: Map[String, SQLMetric])(metricsReporter: 
ShuffleWriteMetricsReporter)
--- End diff --

why are there two parameter list here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r239308082
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -38,12 +38,18 @@ case class CollectLimitExec(limit: Int, child: 
SparkPlan) extends UnaryExecNode
   override def outputPartitioning: Partitioning = SinglePartition
   override def executeCollect(): Array[InternalRow] = 
child.executeTake(limit)
   private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
-  override lazy val metrics = 
SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext)
+  private val writeMetrics = 
SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext)
--- End diff --

why is metrics lazy val and this one val?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r239308007
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -38,12 +38,18 @@ case class CollectLimitExec(limit: Int, child: 
SparkPlan) extends UnaryExecNode
   override def outputPartitioning: Partitioning = SinglePartition
   override def executeCollect(): Array[InternalRow] = 
child.executeTake(limit)
   private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
-  override lazy val metrics = 
SQLShuffleMetricsReporter.createShuffleReadMetrics(sparkContext)
+  private val writeMetrics = 
SQLShuffleWriteMetricsReporter.createShuffleWriteMetrics(sparkContext)
+  override lazy val metrics =
--- End diff --

this is somewhat confusing. I'd create a variable for the read metrics so 
you can pass just that into the ShuffledRDD.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23207: [SPARK-26193][SQL] Implement shuffle write metrics in SQ...

2018-12-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23207
  
@xuanyuanking can you separate the prs to rename read side metric and the 
write side change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r238845399
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -299,12 +312,25 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
   val df2 = (1 to 10).map(i => (i, i.toString)).toSeq.toDF("key", 
"value")
   // Assume the execution plan is
-  // ... -> ShuffledHashJoin(nodeId = 1) -> Project(nodeId = 0)
+  // Project(nodeId = 0)
+  // +- ShuffledHashJoin(nodeId = 1)
+  // :- Exchange(nodeId = 2)
+  // :  +- Project(nodeId = 3)
+  // : +- LocalTableScan(nodeId = 4)
+  // +- Exchange(nodeId = 5)
+  // +- Project(nodeId = 6)
+  // +- LocalTableScan(nodeId = 7)
   val df = df1.join(df2, "key")
   testSparkPlanMetrics(df, 1, Map(
 1L -> (("ShuffledHashJoin", Map(
   "number of output rows" -> 2L,
-  "avg hash probe (min, med, max)" -> "\n(1, 1, 1)"
+  "avg hash probe (min, med, max)" -> "\n(1, 1, 1)"))),
+2L -> (("Exchange", Map(
+  "shuffle records written" -> 2L,
+  "records read" -> 2L))),
--- End diff --

is this always going to be the same as "shuffle records written" ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r238845029
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -170,13 +172,23 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
 val df = testData2.groupBy().agg(collect_set('a)) // 2 partitions
 testSparkPlanMetrics(df, 1, Map(
   2L -> (("ObjectHashAggregate", Map("number of output rows" -> 2L))),
+  1L -> (("Exchange", Map(
+"shuffle records written" -> 2L,
+"records read" -> 2L,
+"local blocks fetched" -> 2L,
--- End diff --

i think we should be consistent and name these "read", rather than "fetch".



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r238843017
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala 
---
@@ -163,6 +171,8 @@ object SQLMetrics {
 Utils.bytesToString
   } else if (metricsType == TIMING_METRIC) {
 Utils.msDurationToString
+  } else if (metricsType == NANO_TIMING_METRIC) {
+duration => Utils.msDurationToString(duration / 10)
--- End diff --

is this the right conversion from nanosecs to millisecs?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r238842276
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala 
---
@@ -78,6 +78,7 @@ object SQLMetrics {
   private val SUM_METRIC = "sum"
   private val SIZE_METRIC = "size"
   private val TIMING_METRIC = "timing"
+  private val NANO_TIMING_METRIC = "nanosecond"
--- End diff --

ns


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r238837000
  
--- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala ---
@@ -50,3 +50,57 @@ private[spark] trait ShuffleWriteMetricsReporter {
   private[spark] def decBytesWritten(v: Long): Unit
   private[spark] def decRecordsWritten(v: Long): Unit
 }
+
+
+/**
+ * A proxy class of ShuffleWriteMetricsReporter which proxy all metrics 
updating to the input
+ * reporters.
+ */
+private[spark] class GroupedShuffleWriteMetricsReporter(
+reporters: Seq[ShuffleWriteMetricsReporter]) extends 
ShuffleWriteMetricsReporter {
+  override private[spark] def incBytesWritten(v: Long): Unit = {
+reporters.foreach(_.incBytesWritten(v))
+  }
+  override private[spark] def decRecordsWritten(v: Long): Unit = {
+reporters.foreach(_.decRecordsWritten(v))
+  }
+  override private[spark] def incRecordsWritten(v: Long): Unit = {
+reporters.foreach(_.incRecordsWritten(v))
+  }
+  override private[spark] def incWriteTime(v: Long): Unit = {
+reporters.foreach(_.incWriteTime(v))
+  }
+  override private[spark] def decBytesWritten(v: Long): Unit = {
+reporters.foreach(_.decBytesWritten(v))
+  }
+}
+
+
+/**
+ * A proxy class of ShuffleReadMetricsReporter which proxy all metrics 
updating to the input
+ * reporters.
+ */
+private[spark] class GroupedShuffleReadMetricsReporter(
--- End diff --

Again - I think your old approach is much better. No point creating a 
general util when there is only one implementation without any known future 
needs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23207: [SPARK-26193][SQL] Implement shuffle write metric...

2018-12-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23207#discussion_r238836448
  
--- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala ---
@@ -50,3 +50,57 @@ private[spark] trait ShuffleWriteMetricsReporter {
   private[spark] def decBytesWritten(v: Long): Unit
   private[spark] def decRecordsWritten(v: Long): Unit
 }
+
+
+/**
+ * A proxy class of ShuffleWriteMetricsReporter which proxy all metrics 
updating to the input
+ * reporters.
+ */
+private[spark] class GroupedShuffleWriteMetricsReporter(
--- End diff --

I'd not create a general API here. Just put one in SQL similar to the read 
side that also calls the default one.

It can be expensive to go through a seq for each record and bytes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
Basically logically there are only two expressions: In which handles 
arbitrary expressions, and InSet which handles expressions with literals. Both 
could work: (1) we provide two separate expressions for InSet, one using 
switch, and one using hashset, or (2) we just provide one InSet and internally 
in InSet have two implementations ... 

The downside with creating different expressions for the same logical 
expression is that potentially the downstream optimization rules would need to 
match more.

On Mon, Dec 03, 2018 at 10:52 PM, DB Tsai < notificati...@github.com > 
wrote:

> 
> 
    > 
> @ rxin ( https://github.com/rxin ) switch in Java is still significantly
> faster than hash set even without boxing / unboxing problems when the
> number of elements are small. We were thinking about to have two
> implementations in InSet , and pick up switch if the number of elements 
are
> small, or otherwise pick up hash set one. But this is the same complexity
> as having two implements in In as this PR.
> 
> 
> 
> @ cloud-fan ( https://github.com/cloud-fan ) do you suggest to create an 
OptimizeIn
> which has switch and hash set implementations based on the length of the
> elements and remove InSet ? Basically, what we were thinking above.
> 
> 
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub (
> https://github.com/apache/spark/pull/23171#issuecomment-443991336 ) , or 
mute
> the thread (
> 
https://github.com/notifications/unsubscribe-auth/AATvPKtGyx5jWxgtO1y5WsiXYDAQqRQ4ks5u1hvJgaJpZM4Y4P4J
> ).
> 
> 
>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
I thought InSwitch logically is the same as InSet, in which all the child 
expressions are literals?

On Mon, Dec 03, 2018 at 8:38 PM, Wenchen Fan < notificati...@github.com > 
wrote:

> 
> 
> 
> I think InSet is not an optimized version of In , but just a way to
> separate the implementation for different conditions (the length of the
> list). Maybe we should do the same thing here, create a InSwitch and
> convert In to it when meeting some conditions. One problem is, In and 
InSwitch
> is same in the interpreted version, maybe we should create a base class
> for them.
> 
> 
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub (
> https://github.com/apache/spark/pull/23171#issuecomment-443968486 ) , or 
mute
> the thread (
> 
https://github.com/notifications/unsubscribe-auth/AATvPDTQic0Ii5UD40m_Uj5kMVy4pNExks5u1fxPgaJpZM4Y4P4J
> ).
> 
> 
>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
That probably means we should just optimize InSet to have the switch 
version though? Rather than do it in In?

On Mon, Dec 03, 2018 at 8:20 PM, Wenchen Fan < notificati...@github.com > 
wrote:

> 
> 
    > 
> @ rxin ( https://github.com/rxin ) I proposed the same thing before, but
> one problem is that, we only convert In to InSet when the length of list
> reaches the threshold. If the switch way is faster than hash set when the
> list is small, it seems still worth to optimize In using switch.
> 
> 
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub (
> https://github.com/apache/spark/pull/23171#issuecomment-443965616 ) , or 
mute
> the thread (
> 
https://github.com/notifications/unsubscribe-auth/AATvPEkrUFJuT4FI167cCI9b0nfv16V4ks5u1fgNgaJpZM4Y4P4J
> ).
> 
> 
>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23171
  
I'm not a big fan of making the physical implementation of an expression 
very different depending on the situation. Why can't we just make InSet 
efficient and convert these cases to that?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23192: [SPARK-26241][SQL] Add queryId to IncrementalExecution

2018-12-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23192
  
Thanks @HyukjinKwon. Fixed it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23193: [SPARK-26226][SQL] Track optimization phase for s...

2018-11-30 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23193

[SPARK-26226][SQL] Track optimization phase for streaming queries

## What changes were proposed in this pull request?
In an earlier PR, we missed measuring the optimization phase time for 
streaming queries. This patch adds it.

## How was this patch tested?
Given this is a debugging feature, and it is very convoluted to add tests 
to verify the phase is set properly, I am not introducing a streaming specific 
test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-26226-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23193.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23193


commit 70c319bdaaac4fc4b8b988a96be6f976a63b41bf
Author: Reynold Xin 
Date:   2018-12-01T04:33:21Z

SPARK-26226




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23193: [SPARK-26226][SQL] Track optimization phase for streamin...

2018-11-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23193
  
cc @gatorsmile @jose-torres 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23192: [SPARK-26221][SQL] Add queryId to IncrementalExecution

2018-11-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23192
  
cc @zsxwing @jose-torres 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23192: [SPARK-26221][SQL] Add queryId to IncrementalExec...

2018-11-30 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23192

[SPARK-26221][SQL] Add queryId to IncrementalExecution

## What changes were proposed in this pull request?
This is a small change for better debugging: to pass query uuid in 
IncrementalExecution, when we look at the QueryExecution in isolation to trace 
back the query.

## How was this patch tested?
N/A - just add some field for better debugging.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-26241

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23192


commit c037f4d2fa2c2844ac992d976b492e14ab9bed11
Author: Reynold Xin 
Date:   2018-12-01T04:27:00Z

[SPARK-26221]




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23183: [SPARK-26226][SQL] Update query tracker to report...

2018-11-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23183#discussion_r238019351
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala
 ---
@@ -51,6 +58,18 @@ object QueryPlanningTracker {
 }
   }
 
+  /**
+   * Summary of a phase, with start time and end time so we can construct 
a timeline.
+   */
+  class PhaseSummary(val startTimeMs: Long, val endTimeMs: Long) {
+
+def durationMs: Long = endTimeMs - startTimeMs
+
+override def toString: String = {
+  s"PhaseSummary($startTimeMs, $endTimeMs)"
--- End diff --

so for actual debugging this is not needed right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23183: [SPARK-26226][SQL] Update query tracker to report timeli...

2018-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23183
  
cc @hvanhovell @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23183: [SPARK-26226][SQL] Update query tracker to report...

2018-11-29 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23183

[SPARK-26226][SQL] Update query tracker to report timeline for phases

## What changes were proposed in this pull request?
This patch changes the query plan tracker added earlier to report phase 
timeline, rather than just a duration for each phase. This way, we can easily 
find time that's unaccounted for.

## How was this patch tested?
Updated test cases to reflect that.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-26226

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23183.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23183


commit d200be22afd83472c03a612a22e5b1fb4d4d80ab
Author: Reynold Xin 
Date:   2018-11-29T23:00:49Z

[SPARK-26226][SQL] Update query tracker to report timeline for phases




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23175: [SPARK-26142]followup: Move sql shuffle read metrics rel...

2018-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23175
  
LGTM - merged in master.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23178: [SPARK-26216][SQL] Do not use case class as public API (...

2018-11-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23178
  
Good idea to have it sealed!

> On Nov 29, 2018, at 7:04 AM, Sean Owen  wrote:
> 
> @srowen commented on this pull request.
> 
> In 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
> 
> >  if (inputTypes.isDefined) {
>assert(inputTypes.get.length == nullableTypes.get.length)
>  }
>  
> +val inputsNullSafe = if (nullableTypes.isEmpty) {
> You can use getOrElse here and even inline this into the call below, but 
I don't really care.
> 
> In 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala:
> 
> > @@ -38,114 +38,108 @@ import org.apache.spark.sql.types.DataType
>   * @since 1.3.0
>   */
>  @Stable
> -case class UserDefinedFunction protected[sql] (
> -f: AnyRef,
> -dataType: DataType,
> -inputTypes: Option[Seq[DataType]]) {
> -
> -  private var _nameOption: Option[String] = None
> -  private var _nullable: Boolean = true
> -  private var _deterministic: Boolean = true
> -
> -  // This is a `var` instead of in the constructor for backward 
compatibility of this case class.
> -  // TODO: revisit this case class in Spark 3.0, and narrow down the 
public surface.
> -  private[sql] var nullableTypes: Option[Seq[Boolean]] = None
> +trait UserDefinedFunction {
> Should we make this sealed? I'm not sure. Would any user ever extend this 
meaningfully? I kind of worry someone will start doing so; maybe they already 
subclass it in some cases though. Elsewhere it might help the compiler 
understand in match statements that there is only ever one type of UDF class to 
match on.
> 
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23128
  
@xuanyuanking @cloud-fan when you think about where to put each code block, 
make sure you also think about future evolution of the codebase. In general put 
relevant things closer to each other (e.g. in one class, one file, or one 
method).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...

2018-11-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r237129249
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala 
---
@@ -82,6 +82,14 @@ object SQLMetrics {
 
   private val baseForAvgMetric: Int = 10
 
+  val REMOTE_BLOCKS_FETCHED = "remoteBlocksFetched"
--- End diff --

rather than putting this list and the getShuffleReadMetrics function here, 
we should move it into SQLShuffleMetricsReporter. Otherwise in the future when 
one adds another metric, he/she is likely to forget to update 
SQLShuffleMetricsReporter.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...

2018-11-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r237128247
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.executor.TempShuffleReadMetrics
+
+/**
+ * A shuffle metrics reporter for SQL exchange operators.
+ * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext.
+ * @param metrics All metrics in current SparkPlan. This param should not 
empty and
+ *   contains all shuffle metrics defined in 
[[SQLMetrics.getShuffleReadMetrics]].
+ */
+private[spark] class SQLShuffleMetricsReporter(
+  tempMetrics: TempShuffleReadMetrics,
--- End diff --

4 space indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...

2018-11-28 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r237128189
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala 
---
@@ -194,4 +202,16 @@ object SQLMetrics {
 SparkListenerDriverAccumUpdates(executionId.toLong, metrics.map(m 
=> m.id -> m.value)))
 }
   }
+
+  /**
+   * Create all shuffle read relative metrics and return the Map.
+   */
+  def getShuffleReadMetrics(sc: SparkContext): Map[String, SQLMetric] = 
Map(
--- End diff --

I'd prefer to name this create, rather than get, to imply we are creating a 
new set rather than just returning some existing sets.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...

2018-11-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23086#discussion_r236845375
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -38,7 +38,7 @@ import org.apache.spark.sql.execution.datasources.jdbc._
 import 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource
 import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
 import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils
-import org.apache.spark.sql.sources.v2.{BatchReadSupportProvider, 
DataSourceOptions, DataSourceV2}
+import org.apache.spark.sql.sources.v2._
--- End diff --

I do think this one is too nitpicking. If this gets long it should be 
wildcard. Use an IDE for large reviews like this if needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23106: [SPARK-26141] Enable custom metrics implementation in sh...

2018-11-26 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23106
  
Merging in master. Thanks @squito.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...

2018-11-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23086#discussion_r236492408
  
--- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java 
---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2;
--- End diff --

Everything in catalyst is considered private (although public visibility 
for debugging) and it's best to stay that way.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23106: [SPARK-26141] Enable custom metrics implementatio...

2018-11-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23106#discussion_r236432889
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -242,8 +243,13 @@ private void writeSortedFile(boolean isLastFile) {
   // Note that we intentionally ignore the value of 
`writeMetricsToUse.shuffleWriteTime()`.
   // Consistent with ExternalSorter, we do not count this IO towards 
shuffle write time.
   // This means that this IO time is not accounted for anywhere; 
SPARK-3577 will fix this.
-  writeMetrics.incRecordsWritten(writeMetricsToUse.recordsWritten());
-  
taskContext.taskMetrics().incDiskBytesSpilled(writeMetricsToUse.bytesWritten());
+
+  // This is guaranteed to be a ShuffleWriteMetrics based on the if 
check in the beginning
+  // of this file.
--- End diff --

ah yes. nice catch


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23147: [SPARK-26140] followup: rename ShuffleMetricsReporter

2018-11-26 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23147
  
cc @gatorsmile @xuanyuanking 

@cloud-fan I misunderstood your comment. Finally saw it today when I was 
looking at my other PR.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23147: [SPARK-26140] followup: rename ShuffleMetricsRepo...

2018-11-26 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23147

[SPARK-26140] followup: rename ShuffleMetricsReporter

## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/23105, due to working on two 
parallel PRs at once, I made the mistake of committing the copy of the PR that 
used the name ShuffleMetricsReporter for the interface, rather than the 
appropriate one ShuffleReadMetricsReporter. This patch fixes that.

## How was this patch tested?
This should be fine as long as compilation passes.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark ShuffleReadMetricsReporter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23147.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23147


commit 1d28d879572aa958b169acc5e1a48e52cced4c26
Author: Reynold Xin 
Date:   2018-11-26T18:56:18Z

ShuffleReadMetricsReporter




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23135: [SPARK-26168][SQL] Update the code comments in Ex...

2018-11-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23135#discussion_r236089467
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -575,6 +575,19 @@ case class Range(
   }
 }
 
+/**
+ * This is a Group by operator with the aggregate functions and 
projections.
+ *
+ * @param groupingExpressions expressions for grouping keys
+ * @param aggregateExpressions expressions for a project list, which could 
contain
+ *   
[[org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction]]s.
+ *
+ * Note: Currently, aggregateExpressions correspond to both 
[[AggregateExpression]] and the output
--- End diff --

It is not clear what “resultExpressions” mean.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23131: [SPARK-25908][SQL][FOLLOW-UP] Add back unionAll

2018-11-24 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23131#discussion_r236052557
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1852,6 +1852,19 @@ class Dataset[T] private[sql](
 CombineUnions(Union(logicalPlan, other.logicalPlan))
   }
 
+  /**
+   * Returns a new Dataset containing union of rows in this Dataset and 
another Dataset.
--- End diff --

say that this is an alias of union.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23129: [MINOR] Update all DOI links to preferred resolver

2018-11-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23129
  
Jenkins, test this please.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...

2018-11-23 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r236025838
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.executor.TempShuffleReadMetrics
+
+/**
+ * A shuffle metrics reporter for SQL exchange operators.
+ * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext.
+ * @param metrics All metrics in current SparkPlan.
+ */
+class SQLShuffleMetricsReporter(
+  tempMetrics: TempShuffleReadMetrics,
+  metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics {
+
+  override def incRemoteBlocksFetched(v: Long): Unit = {
+metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v)
--- End diff --

(I’m not referring to just this function, but in general, especially for 
per-row).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23128: [SPARK-26142][SQL] Support passing shuffle metric...

2018-11-23 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23128#discussion_r236025817
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLShuffleMetricsReporter.scala
 ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.metric
+
+import org.apache.spark.executor.TempShuffleReadMetrics
+
+/**
+ * A shuffle metrics reporter for SQL exchange operators.
+ * @param tempMetrics [[TempShuffleReadMetrics]] created in TaskContext.
+ * @param metrics All metrics in current SparkPlan.
+ */
+class SQLShuffleMetricsReporter(
+  tempMetrics: TempShuffleReadMetrics,
+  metrics: Map[String, SQLMetric]) extends TempShuffleReadMetrics {
+
+  override def incRemoteBlocksFetched(v: Long): Unit = {
+metrics(SQLMetrics.REMOTE_BLOCKS_FETCHED).add(v)
--- End diff --

Doing a hashmap lookup here could introduce serious performance regressions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...

2018-11-23 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23105#discussion_r236020103
  
--- Diff: core/src/main/scala/org/apache/spark/shuffle/metrics.scala ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle
+
+/**
+ * An interface for reporting shuffle read metrics, for each shuffle. This 
interface assumes
+ * all the methods are called on a single-threaded, i.e. concrete 
implementations would not need
+ * to synchronize.
+ *
+ * All methods have additional Spark visibility modifier to allow public, 
concrete implementations
+ * that still have these methods marked as private[spark].
+ */
+private[spark] trait ShuffleReadMetricsReporter {
--- End diff --

@xuanyuanking just submitted a PR on how to use it :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23105: [SPARK-26140] Enable custom metrics implementatio...

2018-11-23 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23105#discussion_r235950427
  
--- Diff: core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala 
---
@@ -48,7 +48,8 @@ private[spark] trait ShuffleManager {
   handle: ShuffleHandle,
   startPartition: Int,
   endPartition: Int,
-  context: TaskContext): ShuffleReader[K, C]
+  context: TaskContext,
+  metrics: ShuffleMetricsReporter): ShuffleReader[K, C]
--- End diff --

It is a read metrics here actually. In the write PR this is renamed 
ShuffleReadMetricsReporter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23110: [SPARK-26129] Followup - edge behavior for QueryPlanning...

2018-11-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23110
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23110: [SPARK-26129] Followup - edge behavior for QueryP...

2018-11-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23110

[SPARK-26129] Followup - edge behavior for 
QueryPlanningTracker.topRulesByTime

## What changes were proposed in this pull request?
This is an addendum patch for SPARK-26129 that defines the edge case 
behavior for QueryPlanningTracker.topRulesByTime.

## How was this patch tested?
Added unit tests for each behavior.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-26129-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23110


commit 683630ac3fbf054534e2589258793c9baaebfbf5
Author: Reynold Xin 
Date:   2018-11-21T22:25:09Z

[SPARK-26129]




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23106: [SPARK-26141] Enable custom shuffle metrics imple...

2018-11-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23106

[SPARK-26141] Enable custom shuffle metrics implementation in shuffle write

## What changes were proposed in this pull request?
This is the write side counterpart to 
https://github.com/apache/spark/pull/23105

## How was this patch tested?
No behavior change expected, as it is a straightforward refactoring. 
Updated all existing test cases.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-26141

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23106


commit 115bd8bfa49674a2fcfa05517373146e90ec3bf7
Author: Reynold Xin 
Date:   2018-11-21T15:55:56Z

[SPARK-26141] Enable custom shuffle metrics implementation in shuffle write




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23105: [SPARK-26140] Enable custom metrics implementation in sh...

2018-11-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23105
  
cc @jiangxb1987 @squito 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23096: [SPARK-26129][SQL] Instrumentation for per-query plannin...

2018-11-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23096
  
Merging this. Feel free to leave more comments. I'm hoping we can wire this 
into the UI eventually.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23105: [SPARK-26140] Enable passing in a custom shuffle ...

2018-11-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23105#discussion_r235420647
  
--- Diff: 
core/src/main/scala/org/apache/spark/executor/ShuffleReadMetrics.scala ---
@@ -122,34 +123,3 @@ class ShuffleReadMetrics private[spark] () extends 
Serializable {
 }
   }
 }
-
-/**
- * A temporary shuffle read metrics holder that is used to collect shuffle 
read metrics for each
- * shuffle dependency, and all temporary metrics will be merged into the 
[[ShuffleReadMetrics]] at
- * last.
- */
-private[spark] class TempShuffleReadMetrics {
--- End diff --

this was moved to TempShuffleReadMetrics


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23105: [SPARK-26140] Pull TempShuffleReadMetrics creatio...

2018-11-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23105

[SPARK-26140] Pull TempShuffleReadMetrics creation out of shuffle reader

## What changes were proposed in this pull request?
This patch defines an internal Spark interface for reporting shuffle 
metrics and uses that in shuffle reader. Before this patch, shuffle metrics is 
tied to a specific implementation (using a thread local temporary data 
structure and accumulators). After this patch, callers that define their own 
shuffle RDDs can create a custom metrics implementation.

With this patch, we would be able to create a better metrics for the SQL 
layer, e.g. reporting shuffle metrics in the SQL UI, for each exchange operator.

## How was this patch tested?
No behavior change expected, as it is a straightforward refactoring. 
Updated all existing test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-26140

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23105


commit da253b57c14bc0174f0330ae6fa5d3a61647269b
Author: Reynold Xin 
Date:   2018-11-21T14:56:23Z

[SPARK-26140] Pull TempShuffleReadMetrics creation out of shuffle reader




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...

2018-11-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23096#discussion_r235309483
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -648,7 +648,11 @@ class SparkSession private(
* @since 2.0.0
*/
   def sql(sqlText: String): DataFrame = {
-Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
+val tracker = new QueryPlanningTracker
--- End diff --

I don't think it makes sense to add random flags for everything. If the 
argument is that this change has a decent chance of introducing regressions 
(e.g. due to higher memory usage, or cpu overhead), then it would make a lot of 
sense to put it behind a flag so it can be disabled in production if that 
happens.

That said, the overhead on the hot code path here is substantially smaller 
than even transforming the simplest Catalyst plan (hash map look up is orders 
of magnitude cheaper than calling a partial function to transform a Scala 
collection for TreeNode), so I think the risk is so low that it does not 
warrant adding a config.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23100: [WIP][SPARK-26133][ML] Remove deprecated OneHotEncoder a...

2018-11-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23100
  
Change of this type can really piss some people off. Was there consensus on 
this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...

2018-11-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23096#discussion_r235182105
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
 ---
@@ -88,15 +101,20 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] 
extends Logging {
 val startTime = System.nanoTime()
 val result = rule(plan)
 val runTime = System.nanoTime() - startTime
+val effective = !result.fastEquals(plan)
 
-if (!result.fastEquals(plan)) {
+if (effective) {
   queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName)
   
queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime)
   planChangeLogger.log(rule.ruleName, plan, result)
 }
 queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, 
runTime)
 queryExecutionMetrics.incNumExecution(rule.ruleName)
 
+if (tracker ne null) {
--- End diff --

if one calls execute directly tracker would be null.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...

2018-11-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23096#discussion_r235162047
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala
 ---
@@ -88,15 +92,18 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] 
extends Logging {
 val startTime = System.nanoTime()
 val result = rule(plan)
 val runTime = System.nanoTime() - startTime
+val effective = !result.fastEquals(plan)
 
-if (!result.fastEquals(plan)) {
+if (effective) {
   queryExecutionMetrics.incNumEffectiveExecution(rule.ruleName)
   
queryExecutionMetrics.incTimeEffectiveExecutionBy(rule.ruleName, runTime)
   planChangeLogger.log(rule.ruleName, plan, result)
 }
 queryExecutionMetrics.incExecutionTimeBy(rule.ruleName, 
runTime)
 queryExecutionMetrics.incNumExecution(rule.ruleName)
 
+tracker.foreach(_.recordRuleInvocation(rule.ruleName, runTime, 
effective))
--- End diff --

yes! (not great -- but I'd probably remove the global tracker at some point)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...

2018-11-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23096#discussion_r235161825
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -696,7 +701,7 @@ class Analyzer(
   s"avoid errors. Increase the value of 
${SQLConf.MAX_NESTED_VIEW_DEPTH.key} to work " +
   "around this.")
   }
-  executeSameContext(child)
+  executeSameContext(child, None)
--- End diff --

No great reason. I just used None for everything, except the top level, 
because it is very difficult to wire the tracker here without refactoring a lot 
of code.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for per-query ...

2018-11-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23096#discussion_r235161336
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala
 ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.util.BoundedPriorityQueue
+
+
+/**
+ * A simple utility for tracking runtime and associated stats in query 
planning.
+ *
+ * There are two separate concepts we track:
+ *
+ * 1. Phases: These are broad scope phases in query planning, as listed 
below, i.e. analysis,
+ * optimizationm and physical planning (just planning).
+ *
+ * 2. Rules: These are the individual Catalyst rules that we track. In 
addition to time, we also
+ * track the number of invocations and effective invocations.
+ */
+object QueryPlanningTracker {
+
+  // Define a list of common phases here.
+  val PARSING = "parsing"
--- End diff --

Mostly because Scala enum is not great, and I was thinking about making 
this a generic thing that's extensible.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23096: [SPARK-26129][SQL] Instrumentation for per-query plannin...

2018-11-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23096
  
cc @hvanhovell @gatorsmile 

This is different from the existing metrics for rules as it is query 
specific. We might want to replace that one with this in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23096: [SPARK-26129][SQL] Instrumentation for query plan...

2018-11-20 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/23096

[SPARK-26129][SQL] Instrumentation for query planning time

## What changes were proposed in this pull request?
We currently don't have good visibility into query planning time (analysis 
vs optimization vs physical planning). This patch adds a simple utility to 
track the runtime of various rules and various planning phases.

## How was this patch tested?
Added unit tests and end-to-end integration tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-26129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23096


commit b6a3d02f2c2b0eff71f92c3ede854edc3b5bf9f8
Author: Reynold Xin 
Date:   2018-11-20T16:22:35Z

[SPARK-26129][SQL] Instrumentation for query planning time




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of non-struct ty...

2018-11-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23054#discussion_r234569150
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1594,6 +1594,15 @@ object SQLConf {
 "WHERE, which does not follow SQL standard.")
   .booleanConf
   .createWithDefault(false)
+
+  val LEGACY_ALIAS_NON_STRUCT_GROUPING_KEY =
+buildConf("spark.sql.legacy.dataset.aliasNonStructGroupingKey")
--- End diff --

Maybe aliasNonStructGroupingKeyAsValue, and default to true.

Then we can remove this in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...

2018-11-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23054
  
BTW what does the non-primitive types look like? Do they get flattened, or 
is there a strict?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...

2018-11-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23054
  
We should add a “legacy” flag in case somebody’s workload gets broken 
by this. We can remove the legacy flag in a future release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

2018-11-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18784
  
Go for it.

On Fri, Nov 16, 2018 at 6:08 AM Stavros Kontopoulos <
notificati...@github.com> wrote:

> @imaxxs <https://github.com/imaxxs> @rxin <https://github.com/rxin> I
> think its a good time to remove this, I will update the PR if you are all
> ok.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/18784#issuecomment-439403392>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPI-PKZYYhazC7vTtoMHqJv9eA-xlks5uvsbegaJpZM4OoOmC>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23021
  
One thing - I would put “pandas” right after test_ so you get the 
natural
logical grouping with sorting by file name.

On Tue, Nov 13, 2018 at 4:58 PM Hyukjin Kwon 
wrote:

> I am going to push after testing and double checking. The line counts
> would look like this
>
>   54 ./test_utils.py
>  199 ./test_catalog.py
>  503 ./test_grouped_agg_pandas_udf.py
>   45 ./test_group.py
>  320 ./test_session.py
>  153 ./test_readwriter.py
>  806 ./test_scalar_pandas_udf.py
>  216 ./test_pandas_udf.py
>  566 ./test_streaming.py
>   55 ./test_conf.py
>   16 ./__init__.py
>  530 ./test_grouped_map_pandas_udf.py
>  157 ./test_column.py
>  654 ./test_udf.py
>  262 ./test_window_pandas_udf.py
>  278 ./test_functions.py
>  263 ./test_context.py
>  138 ./test_serde.py
>  170 ./test_datasources.py
>  399 ./test_arrow.py
>   96 ./test_appsubmit.py
>  944 ./test_types.py
>  737 ./test_dataframe.py
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/23021#issuecomment-438497006>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPOg1IR6S5Fc4qv2mrPWTsDRRxf1Qks5uu2qagaJpZM4YbTYj>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23021
  
Great initiative!

I'd break the pandas udf one into smaller pieces too, as you suggested. We 
should also investigate why the runtime didn't improve ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-07 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22957
  
i didn't look at your new code, but is your old code safe? e.g. a project 
that depends on the new alias.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15899
  
Thanks for the example. I didn't even know that was possible in earlier 
versions. I just looked it up: looks like Scala 2.11 rewrites for 
comprehensions into map, filter, and flatMap.

That said, I don't think it's a bad deal that this no longer works, given 
it was never intended to work and there's been a deprecation warning.

I still maintain that it is risky to support this, because Scala users 
learn for comprehension not just for a simple "for filter yield", but as a way 
to chain multiple generators together, which is not really well supported by 
Spark (even if it is, it's a really bad operation for users to shoot themselves 
in the foot because it would be a cartesian product).

Rather than faking it as a local collection, users should know RDD is not.






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15899#discussion_r231390266
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -387,6 +387,14 @@ abstract class RDD[T: ClassTag](
   preservesPartitioning = true)
   }
 
+  /**
+* Return a new RDD containing only the elements that satisfy a 
predicate.
--- End diff --

Why bother unless we have consensus to introduce this API?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22889: [SPARK-25882][SQL] Added a function to join two datasets...

2018-11-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22889
  
Yea good idea (prefer Array over Seq for short lists)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22921
  
seems good to me; might want to leave this open for a few days so more 
people can take a look


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230135473
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -62,17 +62,6 @@ class SQLContext private[sql](val sparkSession: 
SparkSession)
 
   sparkSession.sparkContext.assertNotStopped()
 
-  // Note: Since Spark 2.0 this class has become a wrapper of 
SparkSession, where the
--- End diff --

keep these two lines?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230132632
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -639,20 +639,6 @@ private[spark] object SparkConf extends Logging {
*/
   private val deprecatedConfigs: Map[String, DeprecatedConfig] = {
 val configs = Seq(
-  DeprecatedConfig("spark.cache.class", "0.8",
--- End diff --

do we need to remove these? they are warnings for users if they set the 
wrong config right


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22830
  
Perhaps @jkbradley and @mengxr can comment on it. If the trait is 
inheritable, then protected still means it is part of the API contract.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22830
  
Who introduced this? We should ask the person that introduced it whether it 
can be removed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22870: [SPARK-25862][SQL] Remove rangeBetween APIs intro...

2018-10-28 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22870

[SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608

## What changes were proposed in this pull request?
This patch removes the rangeBetween functions introduced in SPARK-21608. As 
explained in SPARK-25841, these functions are confusing and don't quite work. 
We will redesign them and introduce better ones in SPARK-25843.

## How was this patch tested?
Removed relevant test cases as well. These test cases will need to be added 
back in SPARK-25843.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-25862

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22870.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22870


commit 00b0c6c746f4dbd3aa69071b99cf09bc5b53524a
Author: Reynold Xin 
Date:   2018-10-28T21:11:16Z

[SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22853: [SPARK-25845][SQL] Fix MatchError for calendar in...

2018-10-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22853#discussion_r228608016
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala 
---
@@ -267,6 +267,25 @@ class DataFrameWindowFramesSuite extends QueryTest 
with SharedSQLContext {
 )
   }
 
+  test("range between should accept interval values as left boundary") {
--- End diff --

this is using intervals for both, rather than just left.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22815: [SPARK-25821][SQL] Remove SQLContext methods depr...

2018-10-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22815#discussion_r228594291
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -54,6 +54,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager
  * @groupname Ungrouped Support functions for language integrated queries
  * @since 1.0.0
  */
+@deprecated("Use SparkSession instead", "3.0.0")
--- End diff --

Yea I wouldn't deprecate it now ... the data source API v1 still depends on 
it. 

Actually now I think about it, we should not be deprecate SQLContext until 
dsv2 is stable. Otherwise we have a stable API dsv1 depending on a deprecated 
API.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-10-26 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21588
  
Does this upgrade Hive for execution or also for metastore? Spark supports 
virtually all Hive metastore versions out there, and a lot of deployments do 
run different versions of Spark against the same old Hive metastore, and it'd 
be bad to break connectivity to old Hive metastores.

The execution part is a different story and we can upgrade them easily.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...

2018-10-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22841#discussion_r228376622
  
--- Diff: python/pyspark/sql/window.py ---
@@ -239,34 +212,27 @@ def rangeBetween(self, start, end):
 and "5" means the five off after the current row.
 
 We recommend users use ``Window.unboundedPreceding``, 
``Window.unboundedFollowing``,
-``Window.currentRow``, 
``pyspark.sql.functions.unboundedPreceding``,
-``pyspark.sql.functions.unboundedFollowing`` and 
``pyspark.sql.functions.currentRow``
-to specify special boundary values, rather than using integral 
values directly.
+and ``Window.currentRow`` to specify special boundary values, 
rather than using integral
--- End diff --

what do you mean? the old rangeBetween API is still valid.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...

2018-10-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22775#discussion_r228372331
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -770,8 +776,17 @@ case class SchemaOfJson(
 factory
   }
 
-  override def convert(v: UTF8String): UTF8String = {
-val dt = 
Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, v)) { parser 
=>
+  @transient
+  private lazy val json = child.eval().asInstanceOf[UTF8String]
--- End diff --

It's not weird that users want to use schema_of_json at all.

Imagine it's a very large json with very complicated string. It's pretty 
difficult to actually write the ddl string.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22775
  
I agree it should be a literal value.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...

2018-10-25 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22841

[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608

## What changes were proposed in this pull request?
See the detailed information at 
https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be 
deprecated and redesigned.

## How was this patch tested?
Only deprecation and doc changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-25842

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22841.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22841


commit 0a49c859049a376872053dcfaacba81d47070d77
Author: Reynold Xin 
Date:   2018-10-25T23:44:36Z

[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22821: [SPARK-25832][SQL] remove newly added map related functi...

2018-10-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22821
  
We seem to be splitting hairs here. Why are we providing tech preview to
advanced users? Are you saying they construct expressions directly using
internal APIs? I doubt that’s tech preview.

Users can construct a lot of invalid plans that lead to weird semantics or
behaviors if they try, and this doesn’t really make it worse.

In either case it’s not that difficult to remove them and add them back so
I could see it going  either way.

On Thu, Oct 25, 2018 at 7:48 AM Dongjoon Hyun 
wrote:

> I'm just confused here. Shall we finish the discussion on the email
> thread? @cloud-fan <https://github.com/cloud-fan> and @gatorsmile
> <https://github.com/gatorsmile> . If the decision is officially made like
> that (providing tech. preview to advance users) in the email thread, I'm
> okay with this.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22821#issuecomment-433080200>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPODR0wtirRLTLDDSb7agOF-U8fqLks5uoc8ogaJpZM4X5fHp>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22815: [SPARK-25821][SQL] Remove SQLContext methods deprecated ...

2018-10-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22815
  
LGTM.

On a related note, we should probably deprecate the entire SQLContext.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22144
  

@markhamstra how did you arrive at that conclusion? I said "it’s not a 
new regression and also
somewhat esoteric"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22144
  
It’s certainly not a blocker since it’s not a new regression and also
somewhat esoteric. Would be good to fix though.

On Tue, Oct 23, 2018 at 8:20 AM Wenchen Fan 
wrote:

> This is not a PR that is ready to merge. We are likely talking about
> delaying 2.4.0 for multiple weeks because of this issue. Is it really 
worth?
>
> I'm not sure what's the exact policy, let's ping more people. @rxin
> <https://github.com/rxin> @srowen <https://github.com/srowen> @vanzin
> <https://github.com/vanzin> @felixcheung <https://github.com/felixcheung>
> @gatorsmile <https://github.com/gatorsmile>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22144#issuecomment-432287845>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPF8mJeJgi8oznKX1_RA-iXyRH9jJks5unzPKgaJpZM4WDFyL>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-12 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21157
  
But that would break both ipython notebooks and repl right? Pretty 
significant breaking change.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...

2018-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22010
  
If this is not yet in 2.4 it shouldn’t be merged now.

On Wed, Oct 10, 2018 at 10:57 AM Holden Karau 
wrote:

> Open question: is this suitable for branch-2.4 since it predates the
> branch cut or not? (I know we've gone back and forth on how we do that).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22010#issuecomment-428492653>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPO6Nlv4HOCVe9pPZfCd1GHXoVCDxks5ujbZlgaJpZM4Vw2BM>
> .
>
-- 
-x



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-09-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21157
  
@superbobry which blog were you referring to?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-09-27 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21157
  
so this change would introduce a pretty big regression?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22543: [SPARK-23715][SQL][DOC] improve document for from...

2018-09-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22543#discussion_r220410457
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1018,9 +1018,20 @@ case class TimeAdd(start: Expression, interval: 
Expression, timeZoneId: Option[S
 }
 
 /**
- * Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time 
in UTC, and renders
- * that time as a timestamp in the given time zone. For example, 'GMT+1' 
would yield
- * '2017-07-14 03:40:00.0'.
+ * This is a common function for databases supporting TIMESTAMP WITHOUT 
TIMEZONE. This function
+ * takes a timestamp which is timezone-agnostic, and interprets it as a 
timestamp in UTC, and
+ * renders that timestamp as a timestamp in the given time zone.
+ *
+ * However, timestamp in Spark represents number of microseconds from the 
Unix epoch, which is not
+ * timezone-agnostic. So in Spark this function just shift the timestamp 
value from UTC timezone to
+ * the given timezone.
+ *
+ * This function may return confusing result if the input is a string with 
timezone, e.g.
+ * '2018-03-13T06:18:23+00:00'. The reason is that, Spark firstly cast the 
string to timestamp
+ * according to the timezone in the string, and finally display the result 
by converting the
+ * timestamp to string according to the session local timezone.
+ *
+ * We may remove this function in Spark 3.0.
--- End diff --

should also update the sql doc?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519][CORE] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIG...

2018-09-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22521
  
seems like our tests are really flaky


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22521
  
yup; just did


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22541: [SPARK-23907][SQL] Revert regr_* functions entire...

2018-09-24 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22541

[SPARK-23907][SQL] Revert regr_* functions entirely

## What changes were proposed in this pull request?
This patch reverts entirely all the regr_* functions added in SPARK-23907. 
These were added by @mgaido91 (and proposed by @gatorsmile) to improve 
compatibility with other database systems, without any actual use cases. 
However, they are very rarely used, and in Spark there are much better ways to 
compute these functions, due to Spark's flexibility in exposing real 
programming APIs.

I'm going through all the APIs added in Spark 2.4 and I think we should 
revert these. If there are strong enough demands and more use cases, we can add 
them back in the future pretty easily.

## How was this patch tested?
Reverted test cases also. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-23907

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22541.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22541


commit 623e35f118e3d28a49eb84365079d037fa519186
Author: Reynold Xin 
Date:   2018-09-24T21:30:43Z

[SPARK-23907][SQL] Revert regr_* functions entirely

commit ef0f5b02dbce29d6fbaa6f79f9c2ad62e7a16bb0
Author: Reynold Xin 
Date:   2018-09-24T21:34:34Z

i




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22521
  
Jenkins, retest this please.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HI...

2018-09-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22521

[SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS only once - 
WIP

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-24519

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22521.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22521


commit 77442cf7e4b64b745079a1ee62684503c7b8c123
Author: Reynold Xin 
Date:   2018-09-19T00:58:24Z

[SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS only once

commit f23c2202fbec04983d1181d92f7c124280ebcbe3
Author: Reynold Xin 
Date:   2018-09-21T16:48:59Z

Merge branch 'master' of github.com:apache/spark into SPARK-24519

commit ac3dee3227e4ceee4ec100bbe72988f791ae3c87
Author: Reynold Xin 
Date:   2018-09-21T16:49:52Z

x

commit f6f9658e19ae5e74697ee8846b6ab11ab8eba24c
Author: Reynold Xin 
Date:   2018-09-21T17:31:51Z

fix conflict




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21527: [SPARK-24519] Make the threshold for highly compr...

2018-09-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21527#discussion_r219559889
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -50,7 +50,9 @@ private[spark] sealed trait MapStatus {
 private[spark] object MapStatus {
 
   def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): 
MapStatus = {
-if (uncompressedSizes.length > 2000) {
+if (uncompressedSizes.length >  Option(SparkEnv.get)
--- End diff --

the only tricky thing is how to write the test cases for this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22515: [SPARK-19724][SQL] allowCreatingManagedTableUsing...

2018-09-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22515

[SPARK-19724][SQL] allowCreatingManagedTableUsingNonemptyLocation should 
have legacy prefix

One more legacy config to go ...

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark 
allowCreatingManagedTableUsingNonemptyLocation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22515.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22515


commit f7c372e6f803c86e189e984fa6c1dd81f84454e9
Author: Reynold Xin 
Date:   2018-09-21T02:10:10Z

[SPARK-19724][SQL] allowCreatingManagedTableUsingNonemptyLocation should 
have legacy prefix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22456: [SPARK-19355][SQL] Fix variable names numberOfOut...

2018-09-20 Thread rxin
Github user rxin closed the pull request at:

https://github.com/apache/spark/pull/22456


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22509: [SPARK-25384][SQL] Clarify fromJsonForceNullableSchema w...

2018-09-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22509
  
cc @dongjoon-hyun @MaxGekk  we still need this pr don't we?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >