[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18047 **[Test build #77139 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77139/testReport)** for PR 18047 at commit [`0218578`](https://github.com/apache/spark/commit/0218578eb23fd6a4eb40674009a2791698411607). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18047 **[Test build #77138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77138/testReport)** for PR 18047 at commit [`043d837`](https://github.com/apache/spark/commit/043d8376350ad163d00fb154e551387c22d6dac3). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class StringReplace(srcExpr: Expression, searchExpr: Expression, replaceExpr: Expression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18047 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77138/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18047 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18047: [SPARK-20750][SQL] Built-in SQL Function Support - REPLA...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18047 **[Test build #77138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77138/testReport)** for PR 18047 at commit [`043d837`](https://github.com/apache/spark/commit/043d8376350ad163d00fb154e551387c22d6dac3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18047: [SPARK-20750][SQL] Built-in SQL Function Support ...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/18047 [SPARK-20750][SQL] Built-in SQL Function Support - REPLACE ## What changes were proposed in this pull request? This PR adds built-in SQL function `(REPLACE(, [, ])` `REPLACE()` return that string that is replaced all occurrences with given string. ## How was this patch tested? added new test suites You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-20750 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18047 commit 043d8376350ad163d00fb154e551387c22d6dac3 Author: Kazuaki IshizakiDate: 2017-05-21T05:37:22Z initial commit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18019: [SPARK-20748][SQL] Add built-in SQL function CH[A]R.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18019 **[Test build #77137 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77137/testReport)** for PR 18019 at commit [`e003924`](https://github.com/apache/spark/commit/e0039247dd24559d993b7bbc4cd321f9c9198459). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...
Github user barrybecker4 commented on the issue: https://github.com/apache/spark/pull/17558 It continues to fail with one of the above errors. Here is the command I use to build. ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.5 package --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77136/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18029 **[Test build #77136 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77136/testReport)** for PR 18029 at commit [`ff9a586`](https://github.com/apache/spark/commit/ff9a58669853ae0508d3ef599947d15a92e1f712). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class KinesisInitialPositionInStream (` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18029 **[Test build #77136 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77136/testReport)** for PR 18029 at commit [`ff9a586`](https://github.com/apache/spark/commit/ff9a58669853ae0508d3ef599947d15a92e1f712). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user yssharma commented on the issue: https://github.com/apache/spark/pull/18029 Commit https://github.com/apache/spark/commit/424550c8450937f78ce608ff7b18e46f41478a8a should fix the timeouts mentioned in the https://github.com/apache/spark/commit/b71a8d621ff048958dd5f10ef16cf5989026ed5f commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...
Github user johnc1231 commented on a diff in the pull request: https://github.com/apache/spark/pull/17459#discussion_r117621336 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala --- @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") ( */ @Since("1.3.0") def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = { -// TODO: This implementation may be optimized -toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock) +require(rowsPerBlock > 0, + s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock") +require(colsPerBlock > 0, + s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock") + +val m = numRows() +val n = numCols() +val lastRowBlockIndex = m / rowsPerBlock --- End diff -- Good point. Replaced word "last" with "remainder" and added a small clarifying comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17982: [SPARK-20395][BUILD] Update Scala to 2.11.11 and zinc to...
Github user som-snytt commented on the issue: https://github.com/apache/spark/pull/17982 Thanks for the effort. I'll take a hack soon. If it's hopeless, I'll at least try to track developments with the new REPL API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18046 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77135/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18046 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18046 **[Test build #77135 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77135/testReport)** for PR 18046 at commit [`82ef305`](https://github.com/apache/spark/commit/82ef30599844a098c3059cca480bbd1b709652c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BitLength(child: Expression) extends UnaryExpression with ImplicitCastInputTypes ` * `case class OctetLength(child: Expression) extends UnaryExpression with ImplicitCastInputTypes ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18031#discussion_r117621035 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus( } /** - * A [[MapStatus]] implementation that only stores the average size of non-empty blocks, - * plus a bitmap for tracking which blocks are empty. + * A [[MapStatus]] implementation that stores the accurate size of huge blocks, which are larger + * than both spark.shuffle.accurateBlockThreshold and + * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It stores the + * average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty. * * @param loc location where the task is being executed * @param numNonEmptyBlocks the number of non-empty blocks * @param emptyBlocks a bitmap tracking which blocks are empty * @param avgSize average size of the non-empty blocks + * @param hugeBlockSizes sizes of huge blocks by their reduceId. */ private[spark] class HighlyCompressedMapStatus private ( private[this] var loc: BlockManagerId, private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, -private[this] var avgSize: Long) +private[this] var avgSize: Long, +@transient private var hugeBlockSizes: Map[Int, Byte]) --- End diff -- The control of `spark.reducer.maxSizeInFlight` is not a big problem. It seems to me that any blocks considered as huge should break `maxSizeInFlight` and can't be fetching in parallel. We actually don't need to know accurate size of huge blocks, we just need to know it's huge and it should be more than `maxSizeInFlight`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18031#discussion_r117620949 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus { } else { 0 } +val threshold1 = Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get) +val threshold2 = avgSize * Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get) +val threshold = math.max(threshold1, threshold2) +val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]() +if (numNonEmptyBlocks > 0) { + i = 0 + while (i < totalNumBlocks) { +if (uncompressedSizes(i) > threshold) { + hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i))) + +} +i += 1 + } +} emptyBlocks.trim() emptyBlocks.runOptimize() -new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize) +new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize, --- End diff -- I'd tend to have just one flag and simplify the configuration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Rewrote toBlockMatrix method...
Github user johnc1231 commented on a diff in the pull request: https://github.com/apache/spark/pull/17459#discussion_r117620801 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala --- @@ -108,8 +108,64 @@ class IndexedRowMatrix @Since("1.0.0") ( */ @Since("1.3.0") def toBlockMatrix(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = { -// TODO: This implementation may be optimized -toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock) +require(rowsPerBlock > 0, + s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock") +require(colsPerBlock > 0, + s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock") + +val m = numRows() +val n = numCols() +val lastRowBlockIndex = m / rowsPerBlock +val lastColBlockIndex = n / colsPerBlock +val lastRowBlockSize = (m % rowsPerBlock).toInt +val lastColBlockSize = (n % colsPerBlock).toInt +val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt +val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt + +val blocks = rows.flatMap { ir: IndexedRow => + val blockRow = ir.index / rowsPerBlock + val rowInBlock = ir.index % rowsPerBlock + + ir.vector match { +case SparseVector(size, indices, values) => + indices.zip(values).map { case (index, value) => +val blockColumn = index / colsPerBlock --- End diff -- So it is true that IndexedRowMatrix could have a Long number of rows, but BlockMatrix is backed by an RDD of ((Int, Int), Matrix), so we're limited by that. I can just add a check that computes whether it's possible to make a BlockMatrix from the given IndexedRowMatrix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17993 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77134/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17993 **[Test build #77134 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77134/testReport)** for PR 17993 at commit [`b8c4147`](https://github.com/apache/spark/commit/b8c4147d3b7dd2c1d0e6b3015042271e754a18cf). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17993 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18031#discussion_r117620188 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus { } else { 0 } +val threshold1 = Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get) +val threshold2 = avgSize * Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get) +val threshold = math.max(threshold1, threshold2) +val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]() +if (numNonEmptyBlocks > 0) { + i = 0 + while (i < totalNumBlocks) { +if (uncompressedSizes(i) > threshold) { + hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i))) + +} +i += 1 + } +} emptyBlocks.trim() emptyBlocks.runOptimize() -new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize) +new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize, --- End diff -- With the default value (spark.shuffle.accurateBlockThreshold=100M and spark.shuffle.accurateBlockThresholdByTimesAverage=2), Yes. But the user can make it more strict by setting (spark.shuffle.accurateBlockThreshold=0 and spark.shuffle.accurateBlockThresholdByTimesAverage=1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18046 It seems you have the wrong JIRA number. Also, you need to add tests in `SQLQueryTestSuite`. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18046: [SPARK-20746][SQL] Built-in SQL Function Support - all v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18046 **[Test build #77135 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77135/testReport)** for PR 18046 at commit [`82ef305`](https://github.com/apache/spark/commit/82ef30599844a098c3059cca480bbd1b709652c8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17993 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18046: [SPARK-20746][SQL] Built-in SQL Function Support ...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/18046 [SPARK-20746][SQL] Built-in SQL Function Support - all variants of LEN[GTH] ## What changes were proposed in this pull request? This PR adds built-in SQL function `BIT_LENGTH()`, `CHAR_LENGTH()`, and `OCTET_LENGTH()` functions. `BIT_LENGTH()` returns the bit length of the given string or binary expression. `CHAR_LENGTH()` returns the length of the given string or binary expression. (i.e. equal to `LENGTH()`) `OCTET_LENGTH()` returns the byte length of the given string or binary expression. ## How was this patch tested? Added new test suites for these three functions You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-20749 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18046.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18046 commit 82ef30599844a098c3059cca480bbd1b709652c8 Author: Kazuaki IshizakiDate: 2017-05-20T23:08:36Z initial commit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17993 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77133/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17993 **[Test build #77133 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77133/testReport)** for PR 17993 at commit [`cc026da`](https://github.com/apache/spark/commit/cc026da840714bc2f88076dbb2aafa70aa1fa0b7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117619380 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -54,6 +54,62 @@ object ConstantFolding extends Rule[LogicalPlan] { } } +/** + * Substitutes [[Attribute Attributes]] which can be statically evaluated with their corresponding + * value in conjunctive [[Expression Expressions]] + * eg. + * {{{ + * SELECT * FROM table WHERE i = 5 AND j = i + 3 + * ==> SELECT * FROM table WHERE i = 5 AND j = 8 + * }}} + * + * Approach used: + * - Start from AND operator as the root + * - Get all the children conjunctive predicates which are EqualTo / EqualNullSafe such that they + * don't have a `NOT` or `OR` operator in them + * - Populate a mapping of attribute => constant value by looking at all the equals predicates + * - Using this mapping, replace occurrence of the attributes with the corresponding constant values + * in the AND node. + */ +object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { + def containsNonConjunctionPredicates(expression: Expression): Boolean = expression.find { +case _: Not | _: Or => true +case _ => false + }.isDefined + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f: Filter => f transformExpressionsUp { --- End diff -- I was initially doing this for the entire logical plan but now switched to do only for filter operator. Reason: Doing this for the entire logical plan will mess up with JOIN predicates. eg. ``` SELECT * FROM a JOIN b ON a.i = 1 AND b.i = a.i => SELECT * FROM a JOIN b ON a.i = 1 AND b.i = 1 ``` .. the result is a cartesian product and Spark fails (asking to set a config). In case of OUTER JOINs, changing the join predicates might cause regression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17993 **[Test build #77134 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77134/testReport)** for PR 17993 at commit [`b8c4147`](https://github.com/apache/spark/commit/b8c4147d3b7dd2c1d0e6b3015042271e754a18cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17940: [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze ma...
Github user ghoto commented on a diff in the pull request: https://github.com/apache/spark/pull/17940#discussion_r117619161 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -992,7 +992,16 @@ object Matrices { new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose) case sm: BSM[Double] => // There is no isTranspose flag for sparse matrices in Breeze -new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data) +val nsm = if (sm.rowIndices.length > sm.activeSize) { + // This sparse matrix has trainling zeros. --- End diff -- ups. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117618800 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] { } } +/** + * Substitutes [[Attribute Attributes]] which can be statically evaluated with their corresponding + * value in conjunctive [[Expression Expressions]] + * eg. + * {{{ + * SELECT * FROM table WHERE i = 5 AND j = i + 3 + * ==> SELECT * FROM table WHERE i = 5 AND j = 8 + * }}} + */ +object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { + + def containsNonConjunctionPredicates(expression: Expression): Boolean = expression match { +case Not(_) => true +case Or(_, _) => true +case _ => + var result = false + expression.children.foreach { +case Not(_) => result = true +case Or(_, _) => result = true +case other => result = result || containsNonConjunctionPredicates(other) + } + result + } + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case q: LogicalPlan => q transformExpressionsUp { + case and @ (left And right) +if !containsNonConjunctionPredicates(left) && !containsNonConjunctionPredicates(right) => + +val leftEntries = left.collect { + case e @ EqualTo(left: AttributeReference, right: Literal) => ((left, right), e) + case e @ EqualTo(left: Literal, right: AttributeReference) => ((right, left), e) +} +val rightEntries = right.collect { + case e @ EqualTo(left: AttributeReference, right: Literal) => ((left, right), e) + case e @ EqualTo(left: Literal, right: AttributeReference) => ((right, left), e) +} +val constantsMap = AttributeMap(leftEntries.map(_._1) ++ rightEntries.map(_._1)) +val predicates = (leftEntries.map(_._2) ++ rightEntries.map(_._2)).toSet + +def replaceConstants(expression: Expression) = expression transform { + case a: AttributeReference if constantsMap.contains(a) => --- End diff -- Nice catch !!! I changed the logic to handle that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117618804 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] { } } +/** + * Substitutes [[Attribute Attributes]] which can be statically evaluated with their corresponding + * value in conjunctive [[Expression Expressions]] + * eg. + * {{{ + * SELECT * FROM table WHERE i = 5 AND j = i + 3 + * ==> SELECT * FROM table WHERE i = 5 AND j = 8 + * }}} + */ +object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { + + def containsNonConjunctionPredicates(expression: Expression): Boolean = expression match { --- End diff -- did this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17993: [SPARK-20758][SQL] Add Constant propagation optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17993 **[Test build #77133 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77133/testReport)** for PR 17993 at commit [`cc026da`](https://github.com/apache/spark/commit/cc026da840714bc2f88076dbb2aafa70aa1fa0b7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117618801 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantPropagationSuite.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor + +class ConstantPropagationSuite extends PlanTest { + + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("AnalysisNodes", Once, +EliminateSubqueryAliases) :: +Batch("ConstantPropagation", Once, + ColumnPruning, + ConstantPropagation, + ConstantFolding, + BooleanSimplification) :: Nil + } + + val testRelation = LocalRelation('a.int, 'b.int, 'c.int) + + private val columnA = 'a.int + private val columnB = 'b.int + + /** + * Unit tests for constant propagation in expressions. --- End diff -- did this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117618796 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] { } } +/** + * Substitutes [[Attribute Attributes]] which can be statically evaluated with their corresponding + * value in conjunctive [[Expression Expressions]] + * eg. + * {{{ + * SELECT * FROM table WHERE i = 5 AND j = i + 3 + * ==> SELECT * FROM table WHERE i = 5 AND j = 8 + * }}} + */ +object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { + + def containsNonConjunctionPredicates(expression: Expression): Boolean = expression match { +case Not(_) => true +case Or(_, _) => true +case _ => + var result = false + expression.children.foreach { +case Not(_) => result = true +case Or(_, _) => result = true +case other => result = result || containsNonConjunctionPredicates(other) + } + result + } + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case q: LogicalPlan => q transformExpressionsUp { + case and @ (left And right) +if !containsNonConjunctionPredicates(left) && !containsNonConjunctionPredicates(right) => + +val leftEntries = left.collect { + case e @ EqualTo(left: AttributeReference, right: Literal) => ((left, right), e) + case e @ EqualTo(left: Literal, right: AttributeReference) => ((right, left), e) +} +val rightEntries = right.collect { + case e @ EqualTo(left: AttributeReference, right: Literal) => ((left, right), e) + case e @ EqualTo(left: Literal, right: AttributeReference) => ((right, left), e) +} +val constantsMap = AttributeMap(leftEntries.map(_._1) ++ rightEntries.map(_._1)) +val predicates = (leftEntries.map(_._2) ++ rightEntries.map(_._2)).toSet + +def replaceConstants(expression: Expression) = expression transform { + case a: AttributeReference if constantsMap.contains(a) => +constantsMap.get(a).getOrElse(a) +} + +and transform { + case e @ EqualTo(_, _) if !predicates.contains(e) && +e.references.exists(ref => constantsMap.contains(ref)) => --- End diff -- skipped it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117618790 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] { } } +/** + * Substitutes [[Attribute Attributes]] which can be statically evaluated with their corresponding + * value in conjunctive [[Expression Expressions]] + * eg. + * {{{ + * SELECT * FROM table WHERE i = 5 AND j = i + 3 + * ==> SELECT * FROM table WHERE i = 5 AND j = 8 + * }}} + */ +object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { + + def containsNonConjunctionPredicates(expression: Expression): Boolean = expression match { +case Not(_) => true +case Or(_, _) => true +case _ => + var result = false + expression.children.foreach { +case Not(_) => result = true +case Or(_, _) => result = true +case other => result = result || containsNonConjunctionPredicates(other) + } + result + } + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case q: LogicalPlan => q transformExpressionsUp { + case and @ (left And right) +if !containsNonConjunctionPredicates(left) && !containsNonConjunctionPredicates(right) => + +val leftEntries = left.collect { --- End diff -- sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117618788 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] { } } +/** + * Substitutes [[Attribute Attributes]] which can be statically evaluated with their corresponding + * value in conjunctive [[Expression Expressions]] + * eg. + * {{{ + * SELECT * FROM table WHERE i = 5 AND j = i + 3 + * ==> SELECT * FROM table WHERE i = 5 AND j = 8 + * }}} + */ +object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { + + def containsNonConjunctionPredicates(expression: Expression): Boolean = expression match { +case Not(_) => true +case Or(_, _) => true +case _ => + var result = false + expression.children.foreach { +case Not(_) => result = true +case Or(_, _) => result = true +case other => result = result || containsNonConjunctionPredicates(other) + } + result + } + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case q: LogicalPlan => q transformExpressionsUp { + case and @ (left And right) --- End diff -- did this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17993: [SPARK-20758][SQL] Add Constant propagation optim...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/17993#discussion_r117618791 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -54,6 +54,59 @@ object ConstantFolding extends Rule[LogicalPlan] { } } +/** + * Substitutes [[Attribute Attributes]] which can be statically evaluated with their corresponding + * value in conjunctive [[Expression Expressions]] + * eg. + * {{{ + * SELECT * FROM table WHERE i = 5 AND j = i + 3 + * ==> SELECT * FROM table WHERE i = 5 AND j = 8 + * }}} + */ +object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { + + def containsNonConjunctionPredicates(expression: Expression): Boolean = expression match { +case Not(_) => true +case Or(_, _) => true +case _ => + var result = false + expression.children.foreach { +case Not(_) => result = true +case Or(_, _) => result = true +case other => result = result || containsNonConjunctionPredicates(other) + } + result + } + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case q: LogicalPlan => q transformExpressionsUp { + case and @ (left And right) +if !containsNonConjunctionPredicates(left) && !containsNonConjunctionPredicates(right) => + +val leftEntries = left.collect { + case e @ EqualTo(left: AttributeReference, right: Literal) => ((left, right), e) --- End diff -- did this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18040 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77128/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18040 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18040 **[Test build #77128 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77128/testReport)** for PR 18040 at commit [`5951b33`](https://github.com/apache/spark/commit/5951b3358cd676f05b46eab74fe4296e0a3991dc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18037 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77129/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18037 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18037 **[Test build #77129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77129/testReport)** for PR 18037 at commit [`a861819`](https://github.com/apache/spark/commit/a8618194b24fa254584529cc894dbabfd5aafb7e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17978 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17978 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77132/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17978 **[Test build #77132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77132/testReport)** for PR 17978 at commit [`5bfa4dc`](https://github.com/apache/spark/commit/5bfa4dc3ba60655d9a9ce4aded935303b90d33cb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17978 **[Test build #77132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77132/testReport)** for PR 17978 at commit [`5bfa4dc`](https://github.com/apache/spark/commit/5bfa4dc3ba60655d9a9ce4aded935303b90d33cb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17978 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17978 **[Test build #77131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77131/testReport)** for PR 17978 at commit [`2fe9432`](https://github.com/apache/spark/commit/2fe9432945f16b77916244b0cc36ff07cdb53693). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17978 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77131/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17978 **[Test build #77131 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77131/testReport)** for PR 17978 at commit [`2fe9432`](https://github.com/apache/spark/commit/2fe9432945f16b77916244b0cc36ff07cdb53693). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17978 @holdenk Thanks for the comment. Added default value in docstring. @felixcheung Please let me know if there is anything else needed for this PR. Thanks everyone for the review and comments! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17967 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17967 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77130/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17967 **[Test build #77130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77130/testReport)** for PR 17967 at commit [`24818a7`](https://github.com/apache/spark/commit/24818a7b77676665f9e58a88f8cc59073e368062). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17967 **[Test build #77130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77130/testReport)** for PR 17967 at commit [`24818a7`](https://github.com/apache/spark/commit/24818a7b77676665f9e58a88f8cc59073e368062). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18037: [SPARK-20814][mesos] Restore support for spark.executor....
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18037 **[Test build #77129 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77129/testReport)** for PR 18037 at commit [`a861819`](https://github.com/apache/spark/commit/a8618194b24fa254584529cc894dbabfd5aafb7e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18040 **[Test build #77128 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77128/testReport)** for PR 18040 at commit [`5951b33`](https://github.com/apache/spark/commit/5951b3358cd676f05b46eab74fe4296e0a3991dc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18040: [SPARK-20815] [SPARKR] NullPointerException in RPackageU...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18040 Jenkins, ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18038: [MINOR][SPARKRSQL]Remove unnecessary comment in SqlBase....
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18038 please change your title `SPARKRSQL` -> `SQL` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77127/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17298 **[Test build #77127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77127/testReport)** for PR 17298 at commit [`e0c3a6b`](https://github.com/apache/spark/commit/e0c3a6b778f70d7dec94484a187f9de46ab3b11c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17940: [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze ma...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/17940#discussion_r117592116 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -992,7 +992,16 @@ object Matrices { new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose) case sm: BSM[Double] => // There is no isTranspose flag for sparse matrices in Breeze -new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data) +val nsm = if (sm.rowIndices.length > sm.activeSize) { + // This sparse matrix has trainling zeros. --- End diff -- trailing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user szalai1 commented on the issue: https://github.com/apache/spark/pull/17435 @holdenk I am happy to contribute to this project. I changed the error message and added a test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @HyukjinKwon @felixcheung I confirm it works for Javadoc. ![image](https://cloud.githubusercontent.com/assets/11082368/26277962/21dbe70e-3d46-11e7-978f-e422b9122e87.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...
Github user barrybecker4 commented on the issue: https://github.com/apache/spark/pull/17558 The 4th time it failed here again: ``` - caching on disk, replicated - caching in memory and disk, replicated *** FAILED *** java.util.concurrent.TimeoutException: Can't find 2 executors before 3 milliseconds elapsed at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:584) at org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154) at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191) at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191) at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18038: [MINOR][SPARKRSQL]Remove unnecessary comment in SqlBase....
Github user lys0716 commented on the issue: https://github.com/apache/spark/pull/18038 Sorry, it is duplicate to https://github.com/antlr/antlr4/issues/773. But on the second thought, the rule is still a workaround for that issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17435 Thanks for working on this. I feel like the error message could maybe be improved to suggest what the user should be doing? It would be nicer to eventually not have this depend on DataType since we don't have this in the Scala version as @HyukjinKwon pointed out, but I think this could be a good improvement for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17978 One minor optional comment, but not a blocker so LGTM (although if you decide to update the docstring LGTM pending tests). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17978: [SPARK-20736][Python] PySpark StringIndexer suppo...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17978#discussion_r117612782 --- Diff: python/pyspark/ml/feature.py --- @@ -2111,26 +2112,45 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, >>> loadedInverter = IndexToString.load(indexToStringPath) >>> loadedInverter.getLabels() == inverter.getLabels() True +>>> stringIndexer.getStringOrderType() +'frequencyDesc' +>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='error', +... stringOrderType="alphabetDesc") +>>> model = stringIndexer.fit(stringIndDf) +>>> td = model.transform(stringIndDf) +>>> sorted(set([(i[0], i[1]) for i in td.select(td.id, td.indexed).collect()]), +... key=lambda x: x[0]) +[(0, 2.0), (1, 1.0), (2, 0.0), (3, 2.0), (4, 2.0), (5, 0.0)] .. versionadded:: 1.4.0 """ +stringOrderType = Param(Params._dummy(), "stringOrderType", +"How to order labels of string column. The first label after " + +"ordering is assigned an index of 0. Supported options: " + +"frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.", --- End diff -- I know were mixed on doing this, but I like including the default value in the docstring, makes the documentation closer to the Scala doc and makes it easier to read without having to refer to the ScalaDoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...
Github user barrybecker4 commented on the issue: https://github.com/apache/spark/pull/17558 The 3rd time I ran, it ran for 42 minutes, and failed further on in catalyst tests. Like you say, it does seem that the tests are flaky, but why? The failures seem so random. ``` - GenerateOrdering with FloatType - GenerateOrdering with ShortType - SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB *** FAILED *** com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:905) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:188) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:43) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:889) at org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply$mcV$sp(OrderingSuite.scala:138) at org.apache.spark.sql.catalyst.expressions.OrderingSuite$$anonfun$1.apply(OrderingSuite.scala:131) ... Cause: java.lang.StackOverflowError: at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17298 **[Test build #77127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77127/testReport)** for PR 17298 at commit [`e0c3a6b`](https://github.com/apache/spark/commit/e0c3a6b778f70d7dec94484a187f9de46ab3b11c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18041: [SPARK-20816][CORE] MetricsConfig doen't trim the proper...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/18041 @srowen It's not a real normal class not found case. And I do know what happened here. What I point out is a case that a whitespace at the end of the class name will cause ClassNotFound exception. This case is very confused to user. If it can be trimmed before reflection, that's much good I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17899: [SPARK-20636] Add new optimization rule to flip adjacent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17899 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77126/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17899: [SPARK-20636] Add new optimization rule to flip adjacent...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17899 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17899: [SPARK-20636] Add new optimization rule to flip adjacent...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17899 **[Test build #77126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77126/testReport)** for PR 17899 at commit [`f472bfe`](https://github.com/apache/spark/commit/f472bfecfcc008b3837aa1ecb903e02bbf665c9e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17558: [SPARK-20247][CORE] Add jar but this jar is missing late...
Github user barrybecker4 commented on the issue: https://github.com/apache/spark/pull/17558 I ran it again, and got a different failure this time. Still in the core module, but not sure if its before or after the tests that failed the first time. ``` - caching in memory, replicated - caching in memory, serialized, replicated - caching on disk, replicated *** FAILED *** java.util.concurrent.TimeoutException: Can't find 2 executors before 3 milliseconds elapsed at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:584) at org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154) at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191) at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191) at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ... ``` I'll try again. It takes a long time to run each time. Over 20 minutes just to get to the failed test, and its not even 1/3 of the way through all the tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18029 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77121/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17758: [SPARK-20460][SQL] Make it more consistent to handle col...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17758 @gatorsmile ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168][WIP][DStream] Add changes to use kinesis f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18029 **[Test build #77121 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77121/testReport)** for PR 18029 at commit [`b71a8d6`](https://github.com/apache/spark/commit/b71a8d621ff048958dd5f10ef16cf5989026ed5f). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaChiSquareTestExample ` * `public class JavaCorrelationExample ` * `case class Cot(child: Expression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17400: [SPARK-19981][SQL] Update output partitioning info. when...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17400 @gatorsmile ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17150: [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17150 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17150: [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17150 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77124/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17150: [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17150 **[Test build #77124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77124/testReport)** for PR 17150 at commit [`13e1d7b`](https://github.com/apache/spark/commit/13e1d7b2876da622904fd4e3e933039b3636ce7e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17982: [SPARK-20395][BUILD] Update Scala to 2.11.11 and ...
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/17982 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17982: [SPARK-20395][BUILD] Update Scala to 2.11.11 and zinc to...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17982 Darn. I don't know if this is going to work. I'm closing this for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18032: [SPARK-20806][DEPLOY] Launcher: redundant check f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18032 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18032: [SPARK-20806][DEPLOY] Launcher: redundant check for Spar...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/18032 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18031#discussion_r117610528 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus { } else { 0 } +val threshold1 = Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get) +val threshold2 = avgSize * Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get) +val threshold = math.max(threshold1, threshold2) +val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]() +if (numNonEmptyBlocks > 0) { + i = 0 + while (i < totalNumBlocks) { +if (uncompressedSizes(i) > threshold) { + hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i))) + +} +i += 1 + } +} emptyBlocks.trim() emptyBlocks.runOptimize() -new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize) +new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize, --- End diff -- In current change, if almost all blocks are huge, that's said it is not a skew case, so we won't mark the blocks as huge ones. Then we will still fetch them into memory? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18031#discussion_r117610423 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus( } /** - * A [[MapStatus]] implementation that only stores the average size of non-empty blocks, - * plus a bitmap for tracking which blocks are empty. + * A [[MapStatus]] implementation that stores the accurate size of huge blocks, which are larger + * than both spark.shuffle.accurateBlockThreshold and + * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It stores the + * average size of other non-empty blocks, plus a bitmap for tracking which blocks are empty. * * @param loc location where the task is being executed * @param numNonEmptyBlocks the number of non-empty blocks * @param emptyBlocks a bitmap tracking which blocks are empty * @param avgSize average size of the non-empty blocks + * @param hugeBlockSizes sizes of huge blocks by their reduceId. */ private[spark] class HighlyCompressedMapStatus private ( private[this] var loc: BlockManagerId, private[this] var numNonEmptyBlocks: Int, private[this] var emptyBlocks: RoaringBitmap, -private[this] var avgSize: Long) +private[this] var avgSize: Long, +@transient private var hugeBlockSizes: Map[Int, Byte]) --- End diff -- Yes, I think it makes sense to add bitmap for hugeBlocks. But I'm a little bit hesitant. I still prefer to have `hugeBlockSizes` more independent from upper logic. In addition, the accurate size of blocks can also have positive effect on pending requests. (e.g. `spark.reducer.maxSizeInFlight` can control the size of pending requests better.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/18031#discussion_r117610285 --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala --- @@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus { } else { 0 } +val threshold1 = Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get) +val threshold2 = avgSize * Option(SparkEnv.get) + .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE)) + .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get) +val threshold = math.max(threshold1, threshold2) +val hugeBlockSizesArray = ArrayBuffer[Tuple2[Int, Byte]]() +if (numNonEmptyBlocks > 0) { + i = 0 + while (i < totalNumBlocks) { +if (uncompressedSizes(i) > threshold) { + hugeBlockSizesArray += Tuple2(i, MapStatus.compressSize(uncompressedSizes(i))) + +} +i += 1 + } +} emptyBlocks.trim() emptyBlocks.runOptimize() -new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize) +new HighlyCompressedMapStatus(loc, numNonEmptyBlocks, emptyBlocks, avgSize, --- End diff -- @viirya Thanks a lot for taking time looking into this pr :) > remove the huge blocks from the numerator in that calculation so that you more accurately size the smaller blocks Yes, I think this is really good idea to have accurate size for smaller blocks. But I'm proposing two configs(`spark.shuffle.accurateBlockThreshold` and `spark.shuffle.accurateBlockThresholdByTimesAverage` ) in current change, I have to compute the average twice: 1) the average calculated including huge blocks, thus I can filter out the huge blocks 2) the average calculated without huge blocks, thus I can have accurate size for the smaller blocks. A little bit complicated, right? How about remove the `spark.shuffle.accurateBlockThresholdByTimesAverage` ? Thus we can simplify the logic. @cloud-fan Any ideas about this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77122/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17298: [SPARK-19094][WIP][PySpark] Plumb through logging for IJ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17298 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org