date:20170318

[GitHub] spark issue #17331: [SPARK-19994][SQL] Wrong outputOrdering for right/full o...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17331
  
**[Test build #74809 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74809/testReport)**
 for PR 17331 at commit 
[`e4c41dc`](https://github.com/apache/spark/commit/e4c41dcbca9afdcce5ebe44836f5f8cef0a01bb4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798739
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
--- End diff --

Yeah. this does not make sense here any more. But please add similar 
logging of recovered metadata later, where you have logged the start and 
available offsets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798726
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/Trigger.scala ---
@@ -38,6 +38,26 @@ sealed trait Trigger
 
 /**
  * :: Experimental ::
+ * A trigger that runs a query once then terminates
+ *
+ * Scala Example:
+ * {{{
+ *   df.write.trigger(OneTime)
+ * }}}
+ *
+ * Java Example:
+ * {{{
+ *   df.write.trigger(OneTime.create())
--- End diff --

yes. this doesnt. please fix them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798676
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/Trigger.scala ---
@@ -38,6 +38,51 @@ sealed trait Trigger
 
 /**
  * :: Experimental ::
+ * A trigger that runs a query once then terminates
+ *
+ * Scala Example:
+ * {{{
+ *   df.write.trigger(OneTime)
+ * }}}
+ *
+ * Java Example:
+ * {{{
+ *   df.write.trigger(OneTime.create())
+ * }}}
+ *
+ * @since 2.2.0
+ */
+@Experimental
+@InterfaceStability.Evolving
+case class OneTime() extends Trigger
+
+/**
+ * :: Experimental ::
+ * Used to create [[OneTime]] triggers for [[StreamingQuery]]s.
--- End diff --

Explain that does one time mean.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798581
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
+offsetLog.get(latestBatchId - 1).foreach { lastOffsets =>
+  committedOffsets = lastOffsets.toStreamProgress(sources)
+}
 
-offsetLog.get(batchId - 1).foreach {
-  case lastOffsets =>
-committedOffsets = lastOffsets.toStreamProgress(sources)
-logDebug(s"Resuming with committed offsets: $committedOffsets")
+/* identify the current batch id: if commit log indicates we 
successfully processed the
+ * latest batch id in the offset log, then we can safely move to 
the next batch
+ * i.e., committedBatchId + 1
+ */
+batchCommitLog.getLatest() match {
+  case Some((completionBatchId, _))
+if latestBatchId == completionBatchId => {
+/* The last batch was successfully committed, so we can safely 
process a
+ * new next batch but first:
+ * Make a call to getBatch using the offsets from previous 
batch.
+ * because certain sources (e.g., KafkaSource) assume on 
restart the last
+ * batch will be executed before getOffset is called again.
+ */
+availableOffsets.foreach {
+  case (source, end)
+if committedOffsets.get(source).map(_ != 
end).getOrElse(true) =>
+val start = committedOffsets.get(source)
+logDebug(s"Initializing offset retrieval from $source " +
+  s"at start $start end $end")
+source.getBatch(start, end)
+  case _ =>
+}
+currentBatchId = completionBatchId + 1
+committedOffsets ++= availableOffsets
+// Construct a new batch be recomputing availableOffsets
+constructNextBatch()
+  }
+  case Some((completionBatchId, _)) if completionBatchId + 1 != 
latestBatchId =>
+logWarning(s"batch completion log latest batch id is 
${completionBatchId}, " +
--- End diff --

we generally start log messages with caps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798566
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
+offsetLog.get(latestBatchId - 1).foreach { lastOffsets =>
+  committedOffsets = lastOffsets.toStreamProgress(sources)
--- End diff --

add a comment on what you are trying to do here. also if this is part of 
the `// First assume that we are re-executing the latest batch in the offset 
log` then might as well move this up there if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798671
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/Trigger.scala ---
@@ -38,6 +38,51 @@ sealed trait Trigger
 
 /**
  * :: Experimental ::
+ * A trigger that runs a query once then terminates
+ *
+ * Scala Example:
+ * {{{
+ *   df.write.trigger(OneTime)
+ * }}}
+ *
+ * Java Example:
+ * {{{
+ *   df.write.trigger(OneTime.create())
+ * }}}
+ *
+ * @since 2.2.0
+ */
+@Experimental
+@InterfaceStability.Evolving
+case class OneTime() extends Trigger
+
+/**
+ * :: Experimental ::
+ * Used to create [[OneTime]] triggers for [[StreamingQuery]]s.
+ *
+ * @since 2.2.0
+ */
+@Experimental
+@InterfaceStability.Evolving
+object OneTime {
+
+  /**
+   * Create a [[OneTime]] trigger.
+   *
+   * Example:
+   * {{{
+   *   df.write.trigger(OneTime.create())
--- End diff --

df.writeStream
(also fix if this is present anywhere else)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798546
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
--- End diff --

why remove this debug log?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798642
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
+offsetLog.get(latestBatchId - 1).foreach { lastOffsets =>
+  committedOffsets = lastOffsets.toStreamProgress(sources)
+}
 
-offsetLog.get(batchId - 1).foreach {
-  case lastOffsets =>
-committedOffsets = lastOffsets.toStreamProgress(sources)
-logDebug(s"Resuming with committed offsets: $committedOffsets")
+/* identify the current batch id: if commit log indicates we 
successfully processed the
+ * latest batch id in the offset log, then we can safely move to 
the next batch
+ * i.e., committedBatchId + 1
+ */
+batchCommitLog.getLatest() match {
+  case Some((completionBatchId, _))
+if latestBatchId == completionBatchId => {
+/* The last batch was successfully committed, so we can safely 
process a
+ * new next batch but first:
+ * Make a call to getBatch using the offsets from previous 
batch.
+ * because certain sources (e.g., KafkaSource) assume on 
restart the last
+ * batch will be executed before getOffset is called again.
+ */
+availableOffsets.foreach {
+  case (source, end)
+if committedOffsets.get(source).map(_ != 
end).getOrElse(true) =>
+val start = committedOffsets.get(source)
+logDebug(s"Initializing offset retrieval from $source " +
+  s"at start $start end $end")
+source.getBatch(start, end)
+  case _ =>
+}
+currentBatchId = completionBatchId + 1
+committedOffsets ++= availableOffsets
+// Construct a new batch be recomputing availableOffsets
+constructNextBatch()
+  }
+  case Some((completionBatchId, _)) if completionBatchId + 1 != 
latestBatchId =>
+logWarning(s"batch completion log latest batch id is 
${completionBatchId}, " +
+  s"which is not trailing batchid $latestBatchId by one")
+  case _ => logInfo("no commit log present")
 }
+logDebug(s"Resuming with committed offsets $committedOffsets " +
--- End diff --

Can you also print the batch id.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798664
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/Trigger.scala ---
@@ -38,6 +38,51 @@ sealed trait Trigger
 
 /**
  * :: Experimental ::
+ * A trigger that runs a query once then terminates
+ *
+ * Scala Example:
+ * {{{
+ *   df.write.trigger(OneTime)
+ * }}}
+ *
+ * Java Example:
+ * {{{
+ *   df.write.trigger(OneTime.create())
+ * }}}
+ *
+ * @since 2.2.0
+ */
+@Experimental
+@InterfaceStability.Evolving
+case class OneTime() extends Trigger
+
+/**
+ * :: Experimental ::
+ * Used to create [[OneTime]] triggers for [[StreamingQuery]]s.
+ *
+ * @since 2.2.0
+ */
+@Experimental
+@InterfaceStability.Evolving
+object OneTime {
+
+  /**
+   * Create a [[OneTime]] trigger.
+   *
+   * Example:
+   * {{{
+   *   df.write.trigger(OneTime.create())
+   * }}}
+   *
+   * @since 2.0.0
--- End diff --

fix version to 2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798601
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
+offsetLog.get(latestBatchId - 1).foreach { lastOffsets =>
+  committedOffsets = lastOffsets.toStreamProgress(sources)
+}
 
-offsetLog.get(batchId - 1).foreach {
-  case lastOffsets =>
-committedOffsets = lastOffsets.toStreamProgress(sources)
-logDebug(s"Resuming with committed offsets: $committedOffsets")
+/* identify the current batch id: if commit log indicates we 
successfully processed the
+ * latest batch id in the offset log, then we can safely move to 
the next batch
+ * i.e., committedBatchId + 1
+ */
+batchCommitLog.getLatest() match {
+  case Some((completionBatchId, _))
--- End diff --

I would write the conditions as 
```
case Some((completionBatchId, _)) => 
  if (completionBatchId == latestBatchId) { ... }
  else if (completionBatchId < latestBatchId - 1) { logWarning(...) }
case None => 
```
It is less intuitive to look at the current cases and see whether they are 
exhaustive or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798501
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -291,10 +299,13 @@ class StreamExecution(
   runBatch(sparkSessionToRunBatches)
 }
   }
-
   // Report trigger as finished and construct progress object.
   finishTrigger(dataAvailable)
   if (dataAvailable) {
+// Update committed offsets.
+committedOffsets ++= availableOffsets
+logDebug(s"Commit log write ${currentBatchId}")
--- End diff --

i think this debug statement isnt too useful. rather you can generalize it 
to something like "batch X committed"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798526
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -392,12 +400,26 @@ class StreamExecution(
*  - currentBatchId
*  - committedOffsets
*  - availableOffsets
+   *  The basic structure of this method is as follows:
--- End diff --

really like this explanation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798638
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
+offsetLog.get(latestBatchId - 1).foreach { lastOffsets =>
+  committedOffsets = lastOffsets.toStreamProgress(sources)
+}
 
-offsetLog.get(batchId - 1).foreach {
-  case lastOffsets =>
-committedOffsets = lastOffsets.toStreamProgress(sources)
-logDebug(s"Resuming with committed offsets: $committedOffsets")
+/* identify the current batch id: if commit log indicates we 
successfully processed the
+ * latest batch id in the offset log, then we can safely move to 
the next batch
+ * i.e., committedBatchId + 1
+ */
+batchCommitLog.getLatest() match {
+  case Some((completionBatchId, _))
+if latestBatchId == completionBatchId => {
+/* The last batch was successfully committed, so we can safely 
process a
+ * new next batch but first:
+ * Make a call to getBatch using the offsets from previous 
batch.
+ * because certain sources (e.g., KafkaSource) assume on 
restart the last
+ * batch will be executed before getOffset is called again.
+ */
+availableOffsets.foreach {
+  case (source, end)
--- End diff --

remove case, use `foreach { (source, end) =>  ... }`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798639
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -419,14 +441,44 @@ class StreamExecution(
 SQLConf.SHUFFLE_PARTITIONS.key, 
shufflePartitionsToUse.toString)
 }
 
-logDebug(s"Found possibly unprocessed offsets $availableOffsets " +
-  s"at batch timestamp ${offsetSeqMetadata.batchTimestampMs}")
+offsetLog.get(latestBatchId - 1).foreach { lastOffsets =>
+  committedOffsets = lastOffsets.toStreamProgress(sources)
+}
 
-offsetLog.get(batchId - 1).foreach {
-  case lastOffsets =>
-committedOffsets = lastOffsets.toStreamProgress(sources)
-logDebug(s"Resuming with committed offsets: $committedOffsets")
+/* identify the current batch id: if commit log indicates we 
successfully processed the
+ * latest batch id in the offset log, then we can safely move to 
the next batch
+ * i.e., committedBatchId + 1
+ */
+batchCommitLog.getLatest() match {
+  case Some((completionBatchId, _))
+if latestBatchId == completionBatchId => {
+/* The last batch was successfully committed, so we can safely 
process a
+ * new next batch but first:
+ * Make a call to getBatch using the offsets from previous 
batch.
+ * because certain sources (e.g., KafkaSource) assume on 
restart the last
+ * batch will be executed before getOffset is called again.
+ */
+availableOffsets.foreach {
+  case (source, end)
+if committedOffsets.get(source).map(_ != 
end).getOrElse(true) =>
+val start = committedOffsets.get(source)
+logDebug(s"Initializing offset retrieval from $source " +
--- End diff --

incorrect. you are not doing offset retrieval. rather say something like 
getting latest batch from sources but not executing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106798713
  
--- Diff: R/pkg/R/functions.R ---
@@ -2438,7 +2438,8 @@ setMethod("date_format", signature(y = "Column", x = 
"character"),
 #' from_json
 #'
 #' Parses a column containing a JSON string into a Column of 
\code{structType} with the specified
-#' \code{schema}. If the string is unparseable, the Column will contains 
the value NA.
+#' \code{schema} or array of \code{structType} if \code{asJsonArray} is 
enabled. If the string
--- End diff --

for clarity, I'd suggest saying
`if \code{asJsonArray} is set to \code{TRUE}` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17331: [SPARK-19994][SQL] Wrong outputOrdering for right...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17331#discussion_r106798699
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
 ---
@@ -80,7 +80,18 @@ case class SortMergeJoinExec(
   override def requiredChildDistribution: Seq[Distribution] =
 ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: 
Nil
 
-  override def outputOrdering: Seq[SortOrder] = requiredOrders(leftKeys)
+  override def outputOrdering: Seq[SortOrder] = joinType match {
+case RightOuter =>
+  // For right outer join, values of the left key will be filled with 
nulls if it can't
+  // match the value of the right key, so `nullOrdering` of the left 
key can't be guaranteed.
+  // We should output right key order here.
+  requiredOrders(rightKeys)
+case FullOuter =>
+  // Neither left key nor right key guarantees `nullOrdering` after 
full outer join.
+  Nil
+case _ =>
--- End diff --

If possible, please use the white list. Otherwise, we might forget to 
update this when adding new join types. Then, we should issue the exception for 
the default case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106798694
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -624,41 +627,58 @@ case class StructToJson(
   lazy val writer = new CharArrayWriter()
 
   @transient
-  lazy val gen =
-new JacksonGenerator(
-  child.dataType.asInstanceOf[StructType],
-  writer,
-  new JSONOptions(options, timeZoneId.get))
+  lazy val gen = new JacksonGenerator(
+rowSchema, writer, new JSONOptions(options, timeZoneId.get))
+
+  @transient
+  lazy val rowSchema = child.dataType match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts rows to the JSON output according to the given schema.
+  @transient
+  lazy val converter: Any => UTF8String = {
+def getAndReset(): UTF8String = {
+  gen.flush()
+  val json = writer.toString
+  writer.reset()
+  UTF8String.fromString(json)
+}
+
+child.dataType match {
+  case _: StructType =>
+(row: Any) =>
+  gen.write(row.asInstanceOf[InternalRow])
+  getAndReset()
+  case ArrayType(_: StructType, _) =>
+(arr: Any) =>
+  gen.write(arr.asInstanceOf[ArrayData])
+  getAndReset()
+}
+  }
 
   override def dataType: DataType = StringType
 
-  override def checkInputDataTypes(): TypeCheckResult = {
-if (StructType.acceptsType(child.dataType)) {
+  override def checkInputDataTypes(): TypeCheckResult = child.dataType 
match {
+case _: StructType | ArrayType(_: StructType, _) =>
--- End diff --

right, thanks for testing this out. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74808 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74808/testReport)**
 for PR 16596 at commit 
[`e33b50a`](https://github.com/apache/spark/commit/e33b50aae78c79a425ab1e935498919eb0350c97).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16596
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17310: [SPARK-18579][SQL] Use ignoreLeadingWhiteSpace and ignor...

2017-03-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17310
  
Oh, no. I left them with `true` by default to keep the original behaviour. 
It could look a bit odd because it is `false` in read and `true` in write by 
default but I tried my best to keep the original behaviour by this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17310: [SPARK-18579][SQL] Use ignoreLeadingWhiteSpace and ignor...

2017-03-18 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17310
  
is this a behavior change from 2.0/2.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17314: [SPARK-15790][MLlib] Audit @Since annotations in ...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17314#discussion_r106798435
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala 
---
@@ -30,6 +32,7 @@ import 
org.apache.spark.ml.regression.{AFTSurvivalRegression, AFTSurvivalRegress
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
 
+@Since("2.0.0")
 private[r] class AFTSurvivalRegressionWrapper private (
--- End diff --

here and many files below are private though?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17219#discussion_r106798289
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -315,6 +315,18 @@ def _to_java_trigger(self, sqlContext):
 self.interval)
 
 
+class OneTime():
+"""A trigger that runs a query once and then exits.
+
+.. note:: Experimental
+
+.. versionadded:: 2.1
--- End diff --

this has to be version 2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17246: [SPARK-19906][SS][DOCS] Documentation describing how to ...

2017-03-18 Thread tdas

Github user tdas commented on the issue:

https://github.com/apache/spark/pull/17246
  
Few minor points.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17246: [SPARK-19906][SS][DOCS] Documentation describing ...

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17246#discussion_r106798265
  
--- Diff: docs/structured-streaming-kafka-integration.md ---
@@ -373,11 +374,213 @@ The following configurations are optional:
 
 
 
+## Writing Data to Kafka
+
+Here, we describe the support for writing Streaming Queries and Batch 
Queries to Apache Kafka. Take note that 
+Apache Kafka only supports at least once write semantics. Consequently, 
when writing---either Streaming Queries
+or Batch Queries---to Kafka, some records may be duplicated; this can 
happen, for example, if Kafka needs
+to retry a message that was not acknowledged by a Broker, even though that 
Broker received and wrote the message record.
+Structured Streaming cannot prevent such duplicates from occurring due to 
these Kafka write semantics. However, 
+if writing the query is successful, then you can assume that the query 
output was written at least once. A possible
+solution to remove duplicates when reading the written data could be to 
introduce a primary (unique) key 
+that can be used to perform de-duplication when reading.
+
+Each row being written to Kafka has the following schema:
--- End diff --

The Dataframe being written to Kafka should have the following columns in 
the schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17246: [SPARK-19906][SS][DOCS] Documentation describing ...

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17246#discussion_r106798256
  
--- Diff: docs/structured-streaming-kafka-integration.md ---
@@ -373,11 +374,213 @@ The following configurations are optional:
 
 
 
+## Writing Data to Kafka
+
+Here, we describe the support for writing Streaming Queries and Batch 
Queries to Apache Kafka. Take note that 
+Apache Kafka only supports at least once write semantics. Consequently, 
when writing---either Streaming Queries
+or Batch Queries---to Kafka, some records may be duplicated; this can 
happen, for example, if Kafka needs
+to retry a message that was not acknowledged by a Broker, even though that 
Broker received and wrote the message record.
+Structured Streaming cannot prevent such duplicates from occurring due to 
these Kafka write semantics. However, 
+if writing the query is successful, then you can assume that the query 
output was written at least once. A possible
+solution to remove duplicates when reading the written data could be to 
introduce a primary (unique) key 
+that can be used to perform de-duplication when reading.
--- End diff --

+1 for this suggestion!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17246: [SPARK-19906][SS][DOCS] Documentation describing ...

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17246#discussion_r106798236
  
--- Diff: docs/structured-streaming-kafka-integration.md ---
@@ -15,40 +15,42 @@ For Scala/Java applications using SBT/Maven project 
definitions, link your appli
 For Python applications, you need to add this above library and its 
dependencies when deploying your
 application. See the [Deploying](#deploying) subsection below.
 
-### Creating a Kafka Source Stream
+## Reading Data from Kafka
+
+### Creating a Kafka Source for Streaming Queries
 
 
 
 {% highlight scala %}
 
 // Subscribe to 1 topic
-val ds1 = spark
+val ds = spark
--- End diff --

nvm. minor point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17246: [SPARK-19906][SS][DOCS] Documentation describing ...

2017-03-18 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17246#discussion_r106798211
  
--- Diff: docs/structured-streaming-kafka-integration.md ---
@@ -15,40 +15,42 @@ For Scala/Java applications using SBT/Maven project 
definitions, link your appli
 For Python applications, you need to add this above library and its 
dependencies when deploying your
 application. See the [Deploying](#deploying) subsection below.
 
-### Creating a Kafka Source Stream
+## Reading Data from Kafka
+
+### Creating a Kafka Source for Streaming Queries
 
 
 
 {% highlight scala %}
 
 // Subscribe to 1 topic
-val ds1 = spark
+val ds = spark
--- End diff --

hey ... `load()` will return a df not ds. may be a little confusing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17344: [SPARK-19990][TEST][test-maven][WIP] Use the database af...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17344
  
**[Test build #74807 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74807/testReport)**
 for PR 17344 at commit 
[`90bd976`](https://github.com/apache/spark/commit/90bd9763399f2cbeed3c93b0d0c1adc024d6602e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17344: [SPARK-19990][TEST][test-maven][WIP] Use the database af...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17344
  
cc @windpiger 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17344: [SPARK-19990][TEST][test-maven][WIP] Use the data...

2017-03-18 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/17344

[SPARK-19990][TEST][test-maven][WIP] Use the database after Hive's current 
Database is dropped

### What changes were proposed in this pull request?
This PR is to fix the following test failure in maven and the PR 
https://github.com/apache/spark/pull/15363. 

> org.apache.spark.sql.hive.orc.OrcSourceSuite SPARK-19459/SPARK-18220: 
read char/varchar column written by Hive

```
FAILED: SemanticException [Error 10072]: Database does not exist: db2

  org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
SemanticException [Error 10072]: Database does not exist: db2
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
  at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
  at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
  at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
```

### How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark testtest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17344


commit 90bd9763399f2cbeed3c93b0d0c1adc024d6602e
Author: Xiao Li 
Date:   2017-03-19T04:49:31Z

fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on the issue:

https://github.com/apache/spark/pull/15363
  
@gatorsmile Thank you. It fails on a clean build as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17192
  
Thank you @felixcheung for your review and proceeding this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17182: [SPARK-19840][SQL] Disallow creating permanent functions...

2017-03-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/17182
  
Thank you, @gatorsmile !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17192
  
**[Test build #74806 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74806/testReport)**
 for PR 17192 at commit 
[`5d390e7`](https://github.com/apache/spark/commit/5d390e7ed34b5de2e264c5f116867a77de39f2ec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106797920
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -425,8 +425,8 @@ object FunctionRegistry {
 expression[BitwiseXor]("^"),
 
 // json
-expression[StructToJson]("to_json"),
-expression[JsonToStruct]("from_json"),
+expression[StructsToJson]("to_json"),
+expression[JsonToStructs]("from_json"),
--- End diff --

(It was @maropu 's initial suggestion and @brkyvz who could decide what to 
add agreed on this. It should be fine.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15363
  
Let me try to fix this flaky test. See the failure history of this test 
case. 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106797757
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -624,41 +627,58 @@ case class StructToJson(
   lazy val writer = new CharArrayWriter()
 
   @transient
-  lazy val gen =
-new JacksonGenerator(
-  child.dataType.asInstanceOf[StructType],
-  writer,
-  new JSONOptions(options, timeZoneId.get))
+  lazy val gen = new JacksonGenerator(
+rowSchema, writer, new JSONOptions(options, timeZoneId.get))
+
+  @transient
+  lazy val rowSchema = child.dataType match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts rows to the JSON output according to the given schema.
+  @transient
+  lazy val converter: Any => UTF8String = {
+def getAndReset(): UTF8String = {
+  gen.flush()
+  val json = writer.toString
+  writer.reset()
+  UTF8String.fromString(json)
+}
+
+child.dataType match {
+  case _: StructType =>
+(row: Any) =>
+  gen.write(row.asInstanceOf[InternalRow])
+  getAndReset()
+  case ArrayType(_: StructType, _) =>
+(arr: Any) =>
+  gen.write(arr.asInstanceOf[ArrayData])
+  getAndReset()
+}
+  }
 
   override def dataType: DataType = StringType
 
-  override def checkInputDataTypes(): TypeCheckResult = {
-if (StructType.acceptsType(child.dataType)) {
+  override def checkInputDataTypes(): TypeCheckResult = child.dataType 
match {
+case _: StructType | ArrayType(_: StructType, _) =>
--- End diff --

It seems `StructType.acceptsType` and `ArrayType.acceptsType` call 
`isInstanceOf[StructType]` and `isInstanceOf[ArrayType]`. `isInstanceOf` and 
pattern matching are interchangeable in most cases up to my knowledge. 

(I just found a reference 
https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s15.html)

Namely, this case below should be fine. (Up to my knowledge, Scala forbids 
case-to-case inheritance BTW)

```
scala> case class A()
defined class A

scala> class B extends A
defined class B

scala> new B() match {case _: A => println(1)}
1
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17331: [SPARK-19994][SQL] Wrong outputOrdering for right...

2017-03-18 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17331#discussion_r106797733
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
 ---
@@ -80,7 +80,18 @@ case class SortMergeJoinExec(
   override def requiredChildDistribution: Seq[Distribution] =
 ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: 
Nil
 
-  override def outputOrdering: Seq[SortOrder] = requiredOrders(leftKeys)
+  override def outputOrdering: Seq[SortOrder] = joinType match {
+case RightOuter =>
+  // For right outer join, values of the left key will be filled with 
nulls if it can't
+  // match the value of the right key, so `nullOrdering` of the left 
key can't be guaranteed.
+  // We should output right key order here.
--- End diff --

OK I'll use those comments in the original PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17182: [SPARK-19840][SQL] Disallow creating permanent functions...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17182
  
Will do the work next week. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16209
  
Yes, we need to extend the DDL parser to support the general user-defined 
types. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74805 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74805/testReport)**
 for PR 17343 at commit 
[`368dd29`](https://github.com/apache/spark/commit/368dd29cb7c0cbb06a8762b265172e525e64487d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106797634
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1774,10 +1774,11 @@ def json_tuple(col, *fields):
 def from_json(col, schema, options={}):
 """
 Parses a column containing a JSON string into a [[StructType]] or 
[[ArrayType]]
-with the specified schema. Returns `null`, in the case of an 
unparseable string.
+of [[StructType]]s with the specified schema. Returns `null`, in the 
case of an unparseable
+string.
--- End diff --

Sure, let me try.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106797619
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -624,41 +627,58 @@ case class StructToJson(
   lazy val writer = new CharArrayWriter()
 
   @transient
-  lazy val gen =
-new JacksonGenerator(
-  child.dataType.asInstanceOf[StructType],
-  writer,
-  new JSONOptions(options, timeZoneId.get))
+  lazy val gen = new JacksonGenerator(
+rowSchema, writer, new JSONOptions(options, timeZoneId.get))
+
+  @transient
+  lazy val rowSchema = child.dataType match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
--- End diff --

Ah, it should be fine. This will be caught in 
https://github.com/apache/spark/pull/17192/files/185ea6003d60feed20c56de61c17bc304663d99a#diff-6626026091295ad8c0dfb66ecbcd04b1R663
 ahead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17343
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17343
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74802/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74802 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74802/testReport)**
 for PR 17343 at commit 
[`00da825`](https://github.com/apache/spark/commit/00da8254d060291fe6f2fdec3e30b2f30d5a69c8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17331: [SPARK-19994][SQL] Wrong outputOrdering for right...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17331#discussion_r106797512
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
 ---
@@ -80,7 +80,18 @@ case class SortMergeJoinExec(
   override def requiredChildDistribution: Seq[Distribution] =
 ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: 
Nil
 
-  override def outputOrdering: Seq[SortOrder] = requiredOrders(leftKeys)
+  override def outputOrdering: Seq[SortOrder] = joinType match {
+case RightOuter =>
+  // For right outer join, values of the left key will be filled with 
nulls if it can't
+  // match the value of the right key, so `nullOrdering` of the left 
key can't be guaranteed.
+  // We should output right key order here.
--- End diff --

> // For left and right outer joins, the output is ordered by the streamed 
input's join keys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-18 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r106797505
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
AttributeSet, Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{BinaryNode, Join, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  val result = plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) =>
+  reorder(p, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) =>
+  reorder(j, j.outputSet)
+  }
+  // After reordering is finished, convert OrderedJoin back to Join
+  result transform {
+case oj: OrderedJoin => oj.join
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17331: [SPARK-19994][SQL] Wrong outputOrdering for right/full o...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17331
  
The bug was introduced when we merge `SortMergeJoin` and 
`SortMergerOuterJoin`


https://github.com/apache/spark/pull/11743/files#diff-b669f8cf35f1d2d786582f4d8c49ed14



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17331: [SPARK-19994][SQL] Wrong outputOrdering for right...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17331#discussion_r106797457
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
 ---
@@ -80,7 +80,18 @@ case class SortMergeJoinExec(
   override def requiredChildDistribution: Seq[Distribution] =
 ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: 
Nil
 
-  override def outputOrdering: Seq[SortOrder] = requiredOrders(leftKeys)
+  override def outputOrdering: Seq[SortOrder] = joinType match {
+case RightOuter =>
+  // For right outer join, values of the left key will be filled with 
nulls if it can't
+  // match the value of the right key, so `nullOrdering` of the left 
key can't be guaranteed.
+  // We should output right key order here.
--- End diff --

This comment is misleading. The output ordering is mainly affected by how 
we implement SortMergeJoinExec. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15363
  
**[Test build #74804 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74804/testReport)**
 for PR 15363 at commit 
[`1f6a3d6`](https://github.com/apache/spark/commit/1f6a3d63b2206c933d191408a56a6679789a4db5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15363
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17331: [SPARK-19994][SQL] Wrong outputOrdering for right...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17331#discussion_r106797210
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
 ---
@@ -80,7 +80,18 @@ case class SortMergeJoinExec(
   override def requiredChildDistribution: Seq[Distribution] =
 ClusteredDistribution(leftKeys) :: ClusteredDistribution(rightKeys) :: 
Nil
 
-  override def outputOrdering: Seq[SortOrder] = requiredOrders(leftKeys)
+  override def outputOrdering: Seq[SortOrder] = joinType match {
+case RightOuter =>
+  // For right outer join, values of the left key will be filled with 
nulls if it can't
+  // match the value of the right key, so `nullOrdering` of the left 
key can't be guaranteed.
+  // We should output right key order here.
+  requiredOrders(rightKeys)
--- End diff --

This is output ordering, right? The join result will be returned based on 
right keys?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15363
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74803/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15363
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15363
  
**[Test build #74803 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74803/testReport)**
 for PR 15363 at commit 
[`1f6a3d6`](https://github.com/apache/spark/commit/1f6a3d63b2206c933d191408a56a6679789a4db5).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class StarSchemaDetection(conf: SQLConf) extends PredicateHelper `
  * `case class ReorderJoin(conf: SQLConf) extends Rule[LogicalPlan] with 
PredicateHelper `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74799/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74799 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74799/testReport)**
 for PR 16596 at commit 
[`1821e21`](https://github.com/apache/spark/commit/1821e21483904cf2890e9c7ba420d72a20623a74).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-18 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17191
  
Ah, my bad. I'll re-check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74797/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74797 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74797/testReport)**
 for PR 16596 at commit 
[`cc44ae5`](https://github.com/apache/spark/commit/cc44ae577a972c26623e26d349aa2990d33b5b28).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74795/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74795 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74795/testReport)**
 for PR 16596 at commit 
[`f37c891`](https://github.com/apache/spark/commit/f37c891a8f38c244c8be7c452581778d1e2e180f).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16330
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74794/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16330
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16330
  
**[Test build #74794 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74794/testReport)**
 for PR 16330 at commit 
[`ac9fbfc`](https://github.com/apache/spark/commit/ac9fbfc5d511877f7775c620ff8e1c672880ee50).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74801/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74801 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74801/testReport)**
 for PR 16596 at commit 
[`e33b50a`](https://github.com/apache/spark/commit/e33b50aae78c79a425ab1e935498919eb0350c97).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17343
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74800/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17343
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74800 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74800/testReport)**
 for PR 17343 at commit 
[`1834db6`](https://github.com/apache/spark/commit/1834db60b7f504862f6ef03bc828264c65bdabd3).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17343
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74798/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17343
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74798 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74798/testReport)**
 for PR 17343 at commit 
[`e9ac76e`](https://github.com/apache/spark/commit/e9ac76edb055d08699d9de7a5ff77b7ca8a7f5c6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16596
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74796/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74796 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74796/testReport)**
 for PR 16596 at commit 
[`61d6ba6`](https://github.com/apache/spark/commit/61d6ba64774b4c65a4c05f69e1a97f4f978464db).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15363: [SPARK-17791][SQL] Join reordering using star schema det...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15363
  
**[Test build #74803 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74803/testReport)**
 for PR 15363 at commit 
[`1f6a3d6`](https://github.com/apache/spark/commit/1f6a3d63b2206c933d191408a56a6679789a4db5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread weiqingy

Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/17342
  
Hi, @rxin Could you please review this PR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795788
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
@@ -655,6 +663,148 @@ class CachedTableSuite extends QueryTest with 
SQLTestUtils with SharedSQLContext
 }
   }
 
+  test("SPARK-19993 subquery caching") {
+withTempView("t1", "t2") {
+  Seq(1).toDF("c1").createOrReplaceTempView("t1")
+  Seq(1).toDF("c1").createOrReplaceTempView("t2")
+
+  val ds1 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(ds1) == 0)
+
+  ds1.cache()
+
+  val cachedDs =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(cachedDs) == 1)
+
+  // Additional predicate in the subquery plan should cause a cache 
miss
+  val cachedMissDs =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1 where c1 = 0)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(cachedMissDs) == 0)
+
+  // Simple correlated predicate in subquery
+  val ds2 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(ds2) == 0)
+
+  ds2.cache()
+
+  val cachedDs2 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
+  """.stripMargin)
+
+  assert(getNumInMemoryRelations(cachedDs2) == 1)
+
+  spark.catalog.cacheTable("t1")
+  ds1.unpersist()
--- End diff --

@gatorsmile Sure. will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795772
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
--- End diff --

@gatorsmile OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795763
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -61,6 +63,37 @@ abstract class SubqueryExpression(
   }
 }
 
+/**
+ * This expression is used to represent any form of subquery expression 
namely
+ * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
+ * expression equality works properly when LogicalPlan.sameResult is called
+ * on plans containing SubqueryExpression(s). This is only a transient 
expression
+ * that only lives in the scope of sameResult function call. In other 
words, analyzer,
+ * optimizer or planner never sees this expression type during 
transformation of
+ * plans.
+ */
+case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
+  extends UnaryExpression with Unevaluable {
+  override def dataType: DataType = expr.dataType
+  override def nullable: Boolean = expr.nullable
+  override def child: Expression = expr
+  override def toString: String = 
s"CanonicalizedSubqueryExpr(${expr.toString})"
+
+  // Hashcode is generated conservatively for now i.e it does not include 
the
+  // sub query plan. Doing so causes issue when we canonicalize 
expressions to
+  // re-order them based on hashcode.
+  // TODO : improve the hashcode generation by considering the plan info.
+  override def hashCode(): Int = {
+val h = Objects.hashCode(expr.children)
+h * 31 + Objects.hashCode(this.getClass.getName)
+  }
+
+  override def equals(o: Any): Boolean = o match {
+case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr)
+case other => false
--- End diff --

@gatorsmile Will change. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795750
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
+   * plans which has subquery expressions.
+   */
+  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): 
CanonicalizedSubqueryExpr = {
+// Normalize the outer references in the subquery plan.
+val subPlan = e.plan.transformAllExpressions {
+  case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, 
allowFailures = true)
--- End diff --

@gatorsmile Will change. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74802 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74802/testReport)**
 for PR 17343 at commit 
[`00da825`](https://github.com/apache/spark/commit/00da8254d060291fe6f2fdec3e30b2f30d5a69c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16971
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74793/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16971
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16971
  
**[Test build #74793 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74793/testReport)**
 for PR 16971 at commit 
[`ed6dacd`](https://github.com/apache/spark/commit/ed6dacdb3e3bdfd4e9ccb5c57bf8b4118636b0c6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17191
  
We have the same limitation. To do it in mySQL and Postgres, you need to 
use quotes/backticks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17330
  
Generally, it looks good to me. cc @hvanhovell @rxin @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795294
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
@@ -655,6 +663,148 @@ class CachedTableSuite extends QueryTest with 
SQLTestUtils with SharedSQLContext
 }
   }
 
+  test("SPARK-19993 subquery caching") {
+withTempView("t1", "t2") {
+  Seq(1).toDF("c1").createOrReplaceTempView("t1")
+  Seq(1).toDF("c1").createOrReplaceTempView("t2")
+
+  val ds1 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(ds1) == 0)
+
+  ds1.cache()
+
+  val cachedDs =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(cachedDs) == 1)
+
+  // Additional predicate in the subquery plan should cause a cache 
miss
+  val cachedMissDs =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1 where c1 = 0)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(cachedMissDs) == 0)
+
+  // Simple correlated predicate in subquery
+  val ds2 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(ds2) == 0)
+
+  ds2.cache()
+
+  val cachedDs2 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
+  """.stripMargin)
+
+  assert(getNumInMemoryRelations(cachedDs2) == 1)
+
+  spark.catalog.cacheTable("t1")
+  ds1.unpersist()
--- End diff --

How about splitting the test cases to multiple individual ones? The cache 
will be cleaned for each test case. This can avoid any extra checking, like 
`assert(getNumInMemoryRelations(cachedMissDs) == 0)` or `ds1.unpersist()`
```Scala
  override def afterEach(): Unit = {
try {
  spark.catalog.clearCache()
} finally {
  super.afterEach()
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17342
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74792/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17342
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17342
  
**[Test build #74792 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74792/testReport)**
 for PR 17342 at commit 
[`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795228
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
--- End diff --

Also replace `SubqueryExpression ` by `CanonicalizedSubqueryExpr`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 377 matches

Mail list logo