[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18047
  
**[Test build #77139 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77139/testReport)**
 for PR 18047 at commit 
[`0218578`](https://github.com/apache/spark/commit/0218578eb23fd6a4eb40674009a2791698411607).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18047
  
**[Test build #77138 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77138/testReport)**
 for PR 18047 at commit 
[`043d837`](https://github.com/apache/spark/commit/043d8376350ad163d00fb154e551387c22d6dac3).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class StringReplace(srcExpr: Expression, searchExpr: Expression, 
replaceExpr: Expression)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18047
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77138/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18047
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18047
  
**[Test build #77138 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77138/testReport)**
 for PR 18047 at commit 
[`043d837`](https://github.com/apache/spark/commit/043d8376350ad163d00fb154e551387c22d6dac3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18047: [SPARK-20750][SQL] Built-in SQL Function Support ...

2017-05-20 Thread kiszk
GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/18047

[SPARK-20750][SQL] Built-in SQL Function Support - REPLACE

## What changes were proposed in this pull request?

This PR adds built-in SQL function `(REPLACE(, 
 [, ])`

`REPLACE()` return that string that is replaced all occurrences with given 
string.

## How was this patch tested?

added new test suites

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-20750

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18047


commit 043d8376350ad163d00fb154e551387c22d6dac3
Author: Kazuaki Ishizaki 
Date:   2017-05-21T05:37:22Z

initial commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18019: [SPARK-20748][SQL] Add built-in SQL function CH[A]R.

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18019
  
**[Test build #77137 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77137/testReport)**
 for PR 18019 at commit 
[`e003924`](https://github.com/apache/spark/commit/e0039247dd24559d993b7bbc4cd321f9c9198459).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...

2017-05-20 Thread barrybecker4
Github user barrybecker4 commented on the issue:

https://github.com/apache/spark/pull/17558
  
It continues to fail with one of the above errors. Here is the command I 
use to build.
 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.5 package


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77136/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18029
  
**[Test build #77136 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77136/testReport)**
 for PR 18029 at commit 
[`ff9a586`](https://github.com/apache/spark/commit/ff9a58669853ae0508d3ef599947d15a92e1f712).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class KinesisInitialPositionInStream (`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18029
  
**[Test build #77136 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77136/testReport)**
 for PR 18029 at commit 
[`ff9a586`](https://github.com/apache/spark/commit/ff9a58669853ae0508d3ef599947d15a92e1f712).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread yssharma
Github user yssharma commented on the issue:

https://github.com/apache/spark/pull/18029
  
Commit 
https://github.com/apache/spark/commit/424550c8450937f78ce608ff7b18e46f41478a8a 
should fix the timeouts mentioned in the 
https://github.com/apache/spark/commit/b71a8d621ff048958dd5f10ef16cf5989026ed5f 
commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

2017-05-20 Thread johnc1231
Github user johnc1231 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17459#discussion_r117621336
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
 ---
@@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
*/
   @Since("1.3.0")
   def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
-// TODO: This implementation may be optimized
-toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
+require(rowsPerBlock > 0,
+  s"rowsPerBlock needs to be greater than 0. rowsPerBlock: 
$rowsPerBlock")
+require(colsPerBlock > 0,
+  s"colsPerBlock needs to be greater than 0. colsPerBlock: 
$colsPerBlock")
+
+val m = numRows()
+val n = numCols()
+val lastRowBlockIndex = m / rowsPerBlock
--- End diff --

Good point. Replaced word "last" with "remainder" and added a small 
clarifying comment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17982: [SPARK-20395][BUILD] Update Scala to 2.11.11 and zinc to...

2017-05-20 Thread som-snytt
Github user som-snytt commented on the issue:

https://github.com/apache/spark/pull/17982
  
Thanks for the effort. I'll take a hack soon. If it's hopeless, I'll at 
least try to track developments with the new REPL API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18046
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77135/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18046
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18046
  
**[Test build #77135 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77135/testReport)**
 for PR 18046 at commit 
[`82ef305`](https://github.com/apache/spark/commit/82ef30599844a098c3059cca480bbd1b709652c8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BitLength(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes `
  * `case class OctetLength(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117621035
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus(
 }
 
 /**
- * A [[MapStatus]] implementation that only stores the average size of 
non-empty blocks,
- * plus a bitmap for tracking which blocks are empty.
+ * A [[MapStatus]] implementation that stores the accurate size of huge 
blocks, which are larger
+ * than both spark.shuffle.accurateBlockThreshold and
+ * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It 
stores the
+ * average size of other non-empty blocks, plus a bitmap for tracking 
which blocks are empty.
  *
  * @param loc location where the task is being executed
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizes sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+@transient private var hugeBlockSizes: Map[Int, Byte])
--- End diff --

The control of `spark.reducer.maxSizeInFlight` is not a big problem. It 
seems to me that any blocks considered as huge should break `maxSizeInFlight` 
and can't be fetching in parallel. We actually don't need to know accurate size 
of huge blocks, we just need to know it's huge and it should be more than 
`maxSizeInFlight`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117620949
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
+val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
+if (numNonEmptyBlocks > 0) {
+  i = 0
+  while (i < totalNumBlocks) {
+if (uncompressedSizes(i) > threshold) {
+  hugeBlockSizesArray += Tuple2(i, 
MapStatus.compressSize(uncompressedSizes(i)))
+
+}
+i += 1
+  }
+}
 emptyBlocks.trim()
 emptyBlocks.runOptimize()
-new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize)
+new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize,
--- End diff --

I'd tend to have just one flag and simplify the configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...

2017-05-20 Thread johnc1231
Github user johnc1231 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17459#discussion_r117620801
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
 ---
@@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") (
*/
   @Since("1.3.0")
   def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = {
-// TODO: This implementation may be optimized
-toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
+require(rowsPerBlock > 0,
+  s"rowsPerBlock needs to be greater than 0. rowsPerBlock: 
$rowsPerBlock")
+require(colsPerBlock > 0,
+  s"colsPerBlock needs to be greater than 0. colsPerBlock: 
$colsPerBlock")
+
+val m = numRows()
+val n = numCols()
+val lastRowBlockIndex = m / rowsPerBlock
+val lastColBlockIndex = n / colsPerBlock
+val lastRowBlockSize = (m % rowsPerBlock).toInt
+val lastColBlockSize = (n % colsPerBlock).toInt
+val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
+val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
+
+val blocks = rows.flatMap { ir: IndexedRow =>
+  val blockRow = ir.index / rowsPerBlock
+  val rowInBlock = ir.index % rowsPerBlock
+
+  ir.vector match {
+case SparseVector(size, indices, values) =>
+  indices.zip(values).map { case (index, value) =>
+val blockColumn = index / colsPerBlock
--- End diff --

So it is true that IndexedRowMatrix could have a Long number of rows, but 
BlockMatrix is backed by an RDD of ((Int, Int), Matrix), so we're limited by 
that. I can just add a check that computes whether it's possible to make a 
BlockMatrix from the given IndexedRowMatrix. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17993
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77134/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17993
  
**[Test build #77134 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77134/testReport)**
 for PR 17993 at commit 
[`b8c4147`](https://github.com/apache/spark/commit/b8c4147d3b7dd2c1d0e6b3015042271e754a18cf).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17993
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-20 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117620188
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
+val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
+if (numNonEmptyBlocks > 0) {
+  i = 0
+  while (i < totalNumBlocks) {
+if (uncompressedSizes(i) > threshold) {
+  hugeBlockSizesArray += Tuple2(i, 
MapStatus.compressSize(uncompressedSizes(i)))
+
+}
+i += 1
+  }
+}
 emptyBlocks.trim()
 emptyBlocks.runOptimize()
-new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize)
+new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize,
--- End diff --

With the default value (spark.shuffle.accurateBlockThreshold=100M and 
spark.shuffle.accurateBlockThresholdByTimesAverage=2), Yes.
But the user can make it more strict by setting  
(spark.shuffle.accurateBlockThreshold=0 and 
spark.shuffle.accurateBlockThresholdByTimesAverage=1). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...

2017-05-20 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18046
  
It seems you have the wrong JIRA number. Also, you need to add tests in 
`SQLQueryTestSuite`. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18046
  
**[Test build #77135 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77135/testReport)**
 for PR 18046 at commit 
[`82ef305`](https://github.com/apache/spark/commit/82ef30599844a098c3059cca480bbd1b709652c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17993
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18046: [SPARK-20746][SQL] Built-in SQL Function Support ...

2017-05-20 Thread kiszk
GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/18046

[SPARK-20746][SQL] Built-in SQL Function Support - all variants of LEN[GTH]

## What changes were proposed in this pull request?

This PR adds built-in SQL function `BIT_LENGTH()`, `CHAR_LENGTH()`, and 
`OCTET_LENGTH()` functions.

`BIT_LENGTH()` returns the bit length of the given string or binary 
expression.
`CHAR_LENGTH()` returns the length of the given string or binary 
expression. (i.e. equal to `LENGTH()`)
`OCTET_LENGTH()` returns the byte length of the given string or binary 
expression.

## How was this patch tested?

Added new test suites for these three functions

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-20749

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18046.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18046


commit 82ef30599844a098c3059cca480bbd1b709652c8
Author: Kazuaki Ishizaki 
Date:   2017-05-20T23:08:36Z

initial commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17993
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77133/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17993
  
**[Test build #77133 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77133/testReport)**
 for PR 17993 at commit 
[`cc026da`](https://github.com/apache/spark/commit/cc026da840714bc2f88076dbb2aafa70aa1fa0b7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117619380
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -54,6 +54,62 @@ object ConstantFolding extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Substitutes [[Attribute Attributes]] which can be statically evaluated 
with their corresponding
+ * value in conjunctive [[Expression Expressions]]
+ * eg.
+ * {{{
+ *   SELECT * FROM table WHERE i = 5 AND j = i + 3
+ *   ==>  SELECT * FROM table WHERE i = 5 AND j = 8
+ * }}}
+ *
+ * Approach used:
+ * - Start from AND operator as the root
+ * - Get all the children conjunctive predicates which are EqualTo / 
EqualNullSafe such that they
+ *   don't have a `NOT` or `OR` operator in them
+ * - Populate a mapping of attribute => constant value by looking at all 
the equals predicates
+ * - Using this mapping, replace occurrence of the attributes with the 
corresponding constant values
+ *   in the AND node.
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = 
expression.find {
+case _: Not | _: Or => true
+case _ => false
+  }.isDefined
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f: Filter => f transformExpressionsUp {
--- End diff --

I was initially doing this for the entire logical plan but now switched to 
do only for filter operator. 
Reason: Doing this for the entire logical plan will mess up with JOIN 
predicates. eg.

```
SELECT * FROM a JOIN b ON a.i = 1 AND b.i = a.i
=>
 SELECT * FROM a JOIN b ON a.i = 1 AND b.i = 1
```

.. the result is a cartesian product and Spark fails (asking to set a 
config). In case of OUTER JOINs, changing the join predicates might cause 
regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17993
  
**[Test build #77134 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77134/testReport)**
 for PR 17993 at commit 
[`b8c4147`](https://github.com/apache/spark/commit/b8c4147d3b7dd2c1d0e6b3015042271e754a18cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17940: [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze ma...

2017-05-20 Thread ghoto
Github user ghoto commented on a diff in the pull request:

https://github.com/apache/spark/pull/17940#discussion_r117619161
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -992,7 +992,16 @@ object Matrices {
 new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose)
   case sm: BSM[Double] =>
 // There is no isTranspose flag for sparse matrices in Breeze
-new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, 
sm.data)
+val nsm = if (sm.rowIndices.length > sm.activeSize) {
+  // This sparse matrix has trainling zeros.
--- End diff --

ups.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117618800
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Substitutes [[Attribute Attributes]] which can be statically evaluated 
with their corresponding
+ * value in conjunctive [[Expression Expressions]]
+ * eg.
+ * {{{
+ *   SELECT * FROM table WHERE i = 5 AND j = i + 3
+ *   ==>  SELECT * FROM table WHERE i = 5 AND j = 8
+ * }}}
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = 
expression match {
+case Not(_) => true
+case Or(_, _) => true
+case _ =>
+  var result = false
+  expression.children.foreach {
+case Not(_) => result = true
+case Or(_, _) => result = true
+case other => result = result || 
containsNonConjunctionPredicates(other)
+  }
+  result
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case q: LogicalPlan => q transformExpressionsUp {
+  case and @ (left And right)
+if !containsNonConjunctionPredicates(left) && 
!containsNonConjunctionPredicates(right) =>
+
+val leftEntries = left.collect {
+  case e @ EqualTo(left: AttributeReference, right: Literal) => 
((left, right), e)
+  case e @ EqualTo(left: Literal, right: AttributeReference) => 
((right, left), e)
+}
+val rightEntries = right.collect {
+  case e @ EqualTo(left: AttributeReference, right: Literal) => 
((left, right), e)
+  case e @ EqualTo(left: Literal, right: AttributeReference) => 
((right, left), e)
+}
+val constantsMap = AttributeMap(leftEntries.map(_._1) ++ 
rightEntries.map(_._1))
+val predicates = (leftEntries.map(_._2) ++ 
rightEntries.map(_._2)).toSet
+
+def replaceConstants(expression: Expression) = expression 
transform {
+  case a: AttributeReference if constantsMap.contains(a) =>
--- End diff --

Nice catch !!! I changed the logic to handle that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117618804
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Substitutes [[Attribute Attributes]] which can be statically evaluated 
with their corresponding
+ * value in conjunctive [[Expression Expressions]]
+ * eg.
+ * {{{
+ *   SELECT * FROM table WHERE i = 5 AND j = i + 3
+ *   ==>  SELECT * FROM table WHERE i = 5 AND j = 8
+ * }}}
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = 
expression match {
--- End diff --

did this change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17993
  
**[Test build #77133 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77133/testReport)**
 for PR 17993 at commit 
[`cc026da`](https://github.com/apache/spark/commit/cc026da840714bc2f88076dbb2aafa70aa1fa0b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117618801
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantPropagationSuite.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class ConstantPropagationSuite extends PlanTest {
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+val batches =
+  Batch("AnalysisNodes", Once,
+EliminateSubqueryAliases) ::
+Batch("ConstantPropagation", Once,
+  ColumnPruning,
+  ConstantPropagation,
+  ConstantFolding,
+  BooleanSimplification) :: Nil
+  }
+
+  val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+
+  private val columnA = 'a.int
+  private val columnB = 'b.int
+
+  /**
+   * Unit tests for constant propagation in expressions.
--- End diff --

did this change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117618796
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Substitutes [[Attribute Attributes]] which can be statically evaluated 
with their corresponding
+ * value in conjunctive [[Expression Expressions]]
+ * eg.
+ * {{{
+ *   SELECT * FROM table WHERE i = 5 AND j = i + 3
+ *   ==>  SELECT * FROM table WHERE i = 5 AND j = 8
+ * }}}
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = 
expression match {
+case Not(_) => true
+case Or(_, _) => true
+case _ =>
+  var result = false
+  expression.children.foreach {
+case Not(_) => result = true
+case Or(_, _) => result = true
+case other => result = result || 
containsNonConjunctionPredicates(other)
+  }
+  result
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case q: LogicalPlan => q transformExpressionsUp {
+  case and @ (left And right)
+if !containsNonConjunctionPredicates(left) && 
!containsNonConjunctionPredicates(right) =>
+
+val leftEntries = left.collect {
+  case e @ EqualTo(left: AttributeReference, right: Literal) => 
((left, right), e)
+  case e @ EqualTo(left: Literal, right: AttributeReference) => 
((right, left), e)
+}
+val rightEntries = right.collect {
+  case e @ EqualTo(left: AttributeReference, right: Literal) => 
((left, right), e)
+  case e @ EqualTo(left: Literal, right: AttributeReference) => 
((right, left), e)
+}
+val constantsMap = AttributeMap(leftEntries.map(_._1) ++ 
rightEntries.map(_._1))
+val predicates = (leftEntries.map(_._2) ++ 
rightEntries.map(_._2)).toSet
+
+def replaceConstants(expression: Expression) = expression 
transform {
+  case a: AttributeReference if constantsMap.contains(a) =>
+constantsMap.get(a).getOrElse(a)
+}
+
+and transform {
+  case e @ EqualTo(_, _) if !predicates.contains(e) &&
+e.references.exists(ref => constantsMap.contains(ref)) =>
--- End diff --

skipped it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117618790
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Substitutes [[Attribute Attributes]] which can be statically evaluated 
with their corresponding
+ * value in conjunctive [[Expression Expressions]]
+ * eg.
+ * {{{
+ *   SELECT * FROM table WHERE i = 5 AND j = i + 3
+ *   ==>  SELECT * FROM table WHERE i = 5 AND j = 8
+ * }}}
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = 
expression match {
+case Not(_) => true
+case Or(_, _) => true
+case _ =>
+  var result = false
+  expression.children.foreach {
+case Not(_) => result = true
+case Or(_, _) => result = true
+case other => result = result || 
containsNonConjunctionPredicates(other)
+  }
+  result
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case q: LogicalPlan => q transformExpressionsUp {
+  case and @ (left And right)
+if !containsNonConjunctionPredicates(left) && 
!containsNonConjunctionPredicates(right) =>
+
+val leftEntries = left.collect {
--- End diff --

sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117618788
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Substitutes [[Attribute Attributes]] which can be statically evaluated 
with their corresponding
+ * value in conjunctive [[Expression Expressions]]
+ * eg.
+ * {{{
+ *   SELECT * FROM table WHERE i = 5 AND j = i + 3
+ *   ==>  SELECT * FROM table WHERE i = 5 AND j = 8
+ * }}}
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = 
expression match {
+case Not(_) => true
+case Or(_, _) => true
+case _ =>
+  var result = false
+  expression.children.foreach {
+case Not(_) => result = true
+case Or(_, _) => result = true
+case other => result = result || 
containsNonConjunctionPredicates(other)
+  }
+  result
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case q: LogicalPlan => q transformExpressionsUp {
+  case and @ (left And right)
--- End diff --

did this change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...

2017-05-20 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/17993#discussion_r117618791
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Substitutes [[Attribute Attributes]] which can be statically evaluated 
with their corresponding
+ * value in conjunctive [[Expression Expressions]]
+ * eg.
+ * {{{
+ *   SELECT * FROM table WHERE i = 5 AND j = i + 3
+ *   ==>  SELECT * FROM table WHERE i = 5 AND j = 8
+ * }}}
+ */
+object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper {
+
+  def containsNonConjunctionPredicates(expression: Expression): Boolean = 
expression match {
+case Not(_) => true
+case Or(_, _) => true
+case _ =>
+  var result = false
+  expression.children.foreach {
+case Not(_) => result = true
+case Or(_, _) => result = true
+case other => result = result || 
containsNonConjunctionPredicates(other)
+  }
+  result
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case q: LogicalPlan => q transformExpressionsUp {
+  case and @ (left And right)
+if !containsNonConjunctionPredicates(left) && 
!containsNonConjunctionPredicates(right) =>
+
+val leftEntries = left.collect {
+  case e @ EqualTo(left: AttributeReference, right: Literal) => 
((left, right), e)
--- End diff --

did this change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18040
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77128/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18040
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18040
  
**[Test build #77128 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77128/testReport)**
 for PR 18040 at commit 
[`5951b33`](https://github.com/apache/spark/commit/5951b3358cd676f05b46eab74fe4296e0a3991dc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18037
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77129/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18037
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18037
  
**[Test build #77129 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77129/testReport)**
 for PR 18037 at commit 
[`a861819`](https://github.com/apache/spark/commit/a8618194b24fa254584529cc894dbabfd5aafb7e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17978
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17978
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77132/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17978
  
**[Test build #77132 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77132/testReport)**
 for PR 17978 at commit 
[`5bfa4dc`](https://github.com/apache/spark/commit/5bfa4dc3ba60655d9a9ce4aded935303b90d33cb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17978
  
**[Test build #77132 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77132/testReport)**
 for PR 17978 at commit 
[`5bfa4dc`](https://github.com/apache/spark/commit/5bfa4dc3ba60655d9a9ce4aded935303b90d33cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17978
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17978
  
**[Test build #77131 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77131/testReport)**
 for PR 17978 at commit 
[`2fe9432`](https://github.com/apache/spark/commit/2fe9432945f16b77916244b0cc36ff07cdb53693).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17978
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77131/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17978
  
**[Test build #77131 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77131/testReport)**
 for PR 17978 at commit 
[`2fe9432`](https://github.com/apache/spark/commit/2fe9432945f16b77916244b0cc36ff07cdb53693).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17978
  
@holdenk  Thanks for the comment. Added default value in docstring. 
@felixcheung Please let me know if there is anything else needed for this 
PR. 
Thanks everyone for the review and comments! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17967
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17967
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77130/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17967
  
**[Test build #77130 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77130/testReport)**
 for PR 17967 at commit 
[`24818a7`](https://github.com/apache/spark/commit/24818a7b77676665f9e58a88f8cc59073e368062).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17967
  
**[Test build #77130 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77130/testReport)**
 for PR 17967 at commit 
[`24818a7`](https://github.com/apache/spark/commit/24818a7b77676665f9e58a88f8cc59073e368062).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18037
  
**[Test build #77129 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77129/testReport)**
 for PR 18037 at commit 
[`a861819`](https://github.com/apache/spark/commit/a8618194b24fa254584529cc894dbabfd5aafb7e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18040
  
**[Test build #77128 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77128/testReport)**
 for PR 18040 at commit 
[`5951b33`](https://github.com/apache/spark/commit/5951b3358cd676f05b46eab74fe4296e0a3991dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...

2017-05-20 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18040
  
Jenkins, ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18038: [MINOR][SPARKRSQL]Remove unnecessary comment in SqlBase....

2017-05-20 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18038
  
please change your title `SPARKRSQL` -> `SQL`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77127/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17298
  
**[Test build #77127 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77127/testReport)**
 for PR 17298 at commit 
[`e0c3a6b`](https://github.com/apache/spark/commit/e0c3a6b778f70d7dec94484a187f9de46ab3b11c).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17940: [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze ma...

2017-05-20 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17940#discussion_r117592116
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
---
@@ -992,7 +992,16 @@ object Matrices {
 new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose)
   case sm: BSM[Double] =>
 // There is no isTranspose flag for sparse matrices in Breeze
-new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, 
sm.data)
+val nsm = if (sm.rowIndices.length > sm.activeSize) {
+  // This sparse matrix has trainling zeros.
--- End diff --

trailing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17435: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-05-20 Thread szalai1
Github user szalai1 commented on the issue:

https://github.com/apache/spark/pull/17435
  
@holdenk I am happy to contribute to this project. I changed the error 
message and added a test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-20 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17967
  
@HyukjinKwon @felixcheung I confirm it works for Javadoc. 

![image](https://cloud.githubusercontent.com/assets/11082368/26277962/21dbe70e-3d46-11e7-978f-e422b9122e87.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...

2017-05-20 Thread barrybecker4
Github user barrybecker4 commented on the issue:

https://github.com/apache/spark/pull/17558
  
The 4th time it failed here again:

```
- caching on disk, replicated
- caching in memory and disk, replicated *** FAILED ***
  java.util.concurrent.TimeoutException: Can't find 2 executors before 
3 milliseconds elapsed
  at 
org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:584)
  at 
org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154)
  at 
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
  at 
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
  at 
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18038: [MINOR][SPARKRSQL]Remove unnecessary comment in SqlBase....

2017-05-20 Thread lys0716
Github user lys0716 commented on the issue:

https://github.com/apache/spark/pull/18038
  
Sorry, it is duplicate to https://github.com/antlr/antlr4/issues/773. But 
on the second thought, the rule is still a workaround for that issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17435: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-05-20 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/17435
  
Thanks for working on this. I feel like the error message could maybe be 
improved to suggest what the user should be doing? It would be nicer to 
eventually not have this depend on DataType since we don't have this in the 
Scala version as @HyukjinKwon pointed out, but I think this could be a good 
improvement for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-20 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/17978
  
One minor optional comment, but not a blocker so LGTM (although if you 
decide to update the docstring LGTM pending tests).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...

2017-05-20 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17978#discussion_r117612782
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2111,26 +2112,45 @@ class StringIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, HasHandleInvalid,
 >>> loadedInverter = IndexToString.load(indexToStringPath)
 >>> loadedInverter.getLabels() == inverter.getLabels()
 True
+>>> stringIndexer.getStringOrderType()
+'frequencyDesc'
+>>> stringIndexer = StringIndexer(inputCol="label", 
outputCol="indexed", handleInvalid='error',
+... stringOrderType="alphabetDesc")
+>>> model = stringIndexer.fit(stringIndDf)
+>>> td = model.transform(stringIndDf)
+>>> sorted(set([(i[0], i[1]) for i in td.select(td.id, 
td.indexed).collect()]),
+... key=lambda x: x[0])
+[(0, 2.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 0.0)]
 
 .. versionadded:: 1.4.0
 """
 
+stringOrderType = Param(Params._dummy(), "stringOrderType",
+"How to order labels of string column. The 
first label after " +
+"ordering is assigned an index of 0. Supported 
options: " +
+"frequencyDesc, frequencyAsc, alphabetDesc, 
alphabetAsc.",
--- End diff --

I know were mixed on doing this, but I like including the default value in 
the docstring, makes the documentation closer to the Scala doc and makes it 
easier to read without having to refer to the ScalaDoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...

2017-05-20 Thread barrybecker4
Github user barrybecker4 commented on the issue:

https://github.com/apache/spark/pull/17558
  
The 3rd time I ran, it ran for 42 minutes, and failed further on in 
catalyst tests. Like you say, it does seem that the tests are flaky, but why? 
The failures seem so random.

```
- GenerateOrdering with FloatType
- GenerateOrdering with ShortType
- SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB *** 
FAILED ***
  com.google.common.util.concurrent.ExecutionError: 
java.lang.StackOverflowError
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
  at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
  at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889)
  at 
org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138)
  at 
org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131)
  ...
  Cause: java.lang.StackOverflowError:
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
  ...

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17298
  
**[Test build #77127 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77127/testReport)**
 for PR 17298 at commit 
[`e0c3a6b`](https://github.com/apache/spark/commit/e0c3a6b778f70d7dec94484a187f9de46ab3b11c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18041: [SPARK-20816][CORE] MetricsConfig doen't trim the proper...

2017-05-20 Thread LantaoJin
Github user LantaoJin commented on the issue:

https://github.com/apache/spark/pull/18041
  
@srowen It's not a real normal class not found case. And I do know what 
happened here. What I point out is a case that a whitespace at the end of the 
class name will cause ClassNotFound exception. This case is very confused to 
user. If it can be trimmed before reflection, that's much good I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17899: [SPARK-20636] Add new optimization rule to flip adjacent...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17899
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77126/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17899: [SPARK-20636] Add new optimization rule to flip adjacent...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17899
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17899: [SPARK-20636] Add new optimization rule to flip adjacent...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17899
  
**[Test build #77126 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77126/testReport)**
 for PR 17899 at commit 
[`f472bfe`](https://github.com/apache/spark/commit/f472bfecfcc008b3837aa1ecb903e02bbf665c9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...

2017-05-20 Thread barrybecker4
Github user barrybecker4 commented on the issue:

https://github.com/apache/spark/pull/17558
  
I ran it again, and got a different failure this time. Still in the core 
module, but not sure if its before or after the tests that failed the first 
time.
```
- caching in memory, replicated
- caching in memory, serialized, replicated
- caching on disk, replicated *** FAILED ***
  java.util.concurrent.TimeoutException: Can't find 2 executors before 
3 milliseconds elapsed
  at 
org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:584)
  at 
org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154)
  at 
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
  at 
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
  at 
org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  ...
```
I'll try again. It takes a long time to run each time. Over 20 minutes just 
to get to the failed test, and its not even 1/3 of the way through all the 
tests.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77121/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17758: [SPARK-20460][SQL] Make it more consistent to handle col...

2017-05-20 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17758
  
@gatorsmile ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18029
  
**[Test build #77121 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77121/testReport)**
 for PR 18029 at commit 
[`b71a8d6`](https://github.com/apache/spark/commit/b71a8d621ff048958dd5f10ef16cf5989026ed5f).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaChiSquareTestExample `
  * `public class JavaCorrelationExample `
  * `case class Cot(child: Expression)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17400: [SPARK-19981][SQL] Update output partitioning info. when...

2017-05-20 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17400
  
@gatorsmile ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17150: [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17150
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17150: [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17150
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77124/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17150: [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10

2017-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17150
  
**[Test build #77124 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77124/testReport)**
 for PR 17150 at commit 
[`13e1d7b`](https://github.com/apache/spark/commit/13e1d7b2876da622904fd4e3e933039b3636ce7e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17982: [SPARK-20395][BUILD] Update Scala to 2.11.11 and ...

2017-05-20 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/17982


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17982: [SPARK-20395][BUILD] Update Scala to 2.11.11 and zinc to...

2017-05-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/17982
  
Darn. I don't know if this is going to work. I'm closing this for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18032: [SPARK-20806][DEPLOY] Launcher: redundant check f...

2017-05-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18032


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18032: [SPARK-20806][DEPLOY] Launcher: redundant check for Spar...

2017-05-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18032
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-20 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117610528
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
+val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
+if (numNonEmptyBlocks > 0) {
+  i = 0
+  while (i < totalNumBlocks) {
+if (uncompressedSizes(i) > threshold) {
+  hugeBlockSizesArray += Tuple2(i, 
MapStatus.compressSize(uncompressedSizes(i)))
+
+}
+i += 1
+  }
+}
 emptyBlocks.trim()
 emptyBlocks.runOptimize()
-new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize)
+new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize,
--- End diff --

In current change, if almost all blocks are huge, that's said it is not a 
skew case, so we won't mark the blocks as huge ones. Then we will still fetch 
them into memory?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-20 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117610423
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus(
 }
 
 /**
- * A [[MapStatus]] implementation that only stores the average size of 
non-empty blocks,
- * plus a bitmap for tracking which blocks are empty.
+ * A [[MapStatus]] implementation that stores the accurate size of huge 
blocks, which are larger
+ * than both spark.shuffle.accurateBlockThreshold and
+ * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It 
stores the
+ * average size of other non-empty blocks, plus a bitmap for tracking 
which blocks are empty.
  *
  * @param loc location where the task is being executed
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizes sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+@transient private var hugeBlockSizes: Map[Int, Byte])
--- End diff --

Yes, I think it makes sense to add bitmap for hugeBlocks. But I'm a little 
bit hesitant. I still prefer to have `hugeBlockSizes` more independent from 
upper logic. In addition, the accurate size of blocks can also have positive 
effect on pending requests. (e.g. `spark.reducer.maxSizeInFlight` can control 
the size of pending requests better.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-20 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117610285
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
+val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]()
+if (numNonEmptyBlocks > 0) {
+  i = 0
+  while (i < totalNumBlocks) {
+if (uncompressedSizes(i) > threshold) {
+  hugeBlockSizesArray += Tuple2(i, 
MapStatus.compressSize(uncompressedSizes(i)))
+
+}
+i += 1
+  }
+}
 emptyBlocks.trim()
 emptyBlocks.runOptimize()
-new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize)
+new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, 
avgSize,
--- End diff --

@viirya Thanks a lot for taking time looking into this pr :)

> remove the huge blocks from the numerator in that calculation so that you 
more accurately size the smaller blocks

Yes, I think this is really good idea to have accurate size for smaller 
blocks. But I'm proposing two configs(`spark.shuffle.accurateBlockThreshold` 
and `spark.shuffle.accurateBlockThresholdByTimesAverage` ) in current change, I 
have to compute the average twice: 1) the average calculated including huge 
blocks, thus I can filter out the huge blocks 2) the average calculated without 
huge blocks, thus I can have accurate size for the smaller blocks. A little bit 
complicated, right? How about remove the 
`spark.shuffle.accurateBlockThresholdByTimesAverage` ? Thus we can simplify the 
logic. @cloud-fan Any ideas about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77122/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...

2017-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17298
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >