[GitHub] spark issue #19824: [SPARK][STREAMING] Invoke onBatchCompletion() only when ...
Github user victor-wong commented on the issue: https://github.com/apache/spark/pull/19824 @viirya Sorry for the misleading title, I have changed it now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19856 **[Test build #4002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4002/testReport)** for PR 19856 at commit [`652068b`](https://github.com/apache/spark/commit/652068b58f4140206d461bc95cdac4531f750e65). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18906 @ptkool, for https://github.com/apache/spark/pull/18906#issuecomment-348350023. can you point out Scala API rather than just saying it's consistent with Scala side and making reviewers checking through the codes here and there? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19831 Since Hive can't protect user to set a wrong stats properties, I think this solution can alleviate the problem. Besides, it's consistent with what we do for `totalSize and rawDataSize` (only use the stats when > 0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/19841 Yes i am also confused why the diff is so big but it reported by git originally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19831#discussion_r154250160 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl( // Note that this statistics could be overridden by Spark's statistics if that's available. val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) - val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0) + val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0) --- End diff -- Thanks for the investigation. Seems hive can't protect its stats properties. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154250197 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { +case StringType | BinaryType => + Some(1.0 / BigDecimal(ndv)) + +case _ => + // returns 1/ndv if there is no histogram + if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) + + // We traverse histogram bins to locate the literal value + val hgmBins = colStat.histogram.get.bins + val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble + // find the interval where this datum locates + var lowerId, higherId = -1 + for (i <- hgmBins.indices) { +// if datum > upperBound, just move to next bin +if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i +if (higherId < 0) { + if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) || +((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) { +higherId = i + } +} + } + assert(lowerId <= higherId) + val lowerBinNdv = hgmBins(lowerId).ndv + val higherBinNdv = hgmBins(higherId).ndv + // assume uniform distribution in each bin + val percent = if (lowerId == higherId) { +(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + } else { +1.0 / hgmBins.length * (higherId - lowerId - 1) + + (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + + (1.0 / hgmBins.length) / math.max(higherBinNdv, 1) + } + Some(percent) --- End diff -- Good point. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154249995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- No. I meant the upper bound for the array of bins in a histogram. The default length of the histogram bin array is 254. When i is equal to 253 (the last bin), then i+1 is 254 leading to out-of-bound error. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19831 > Besides, if the size stats totalSize or rawDataSize is wrong, the problem exists whether CBO is enabled or not. > If CBO enabled, the outputRowCount == 0, the getOutputSize is 1, sizeInBytes is 1 and this side can broadcast: If CBO disabled, the sizeInBytes = (p.child.stats.sizeInBytes * outputRowSize) / childRowSize and this side cann't broadcast: @wangyum `totalSize or rawDataSize` can also be wrong, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19823 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84356/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19823 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19823 **[Test build #84356 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84356/testReport)** for PR 19823 at commit [`f6eb4ad`](https://github.com/apache/spark/commit/f6eb4ada04e21760f805b3e50062f91a438cef56). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154248775 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- By "out of bound", do you mean it exceeds 100 length limit? You can just switch new line after `&&` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19839: SPARK-22373 Bump Janino dependency version to fix...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19839 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19856 **[Test build #4002 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4002/testReport)** for PR 19856 at commit [`652068b`](https://github.com/apache/spark/commit/652068b58f4140206d461bc95cdac4531f750e65). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19839: SPARK-22373 Bump Janino dependency version to fix thread...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19839 Merged to master/2.2/2.1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154248457 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- If a user changes the data, statistics will be removed, or re-collected (only size currently), Spark already implements this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19826: [SPARK-22428][DOC] Add spark application garbage ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19826 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19826: [SPARK-22428][DOC] Add spark application garbage collect...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19826 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19831#discussion_r154245570 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -418,7 +418,7 @@ private[hive] class HiveClientImpl( // Note that this statistics could be overridden by Spark's statistics if that's available. val totalSize = properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_)) val rawDataSize = properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_)) - val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0) + val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0) --- End diff -- Maybe this could be more clear: ```scala val rowCount = properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)) val stats = if (totalSize.isDefined && totalSize.get > 0L) { Some(CatalogStatistics(sizeInBytes = totalSize.get, rowCount = rowCount.filter(_ > 0))) } else if (rawDataSize.isDefined && rawDataSize.get > 0) { Some(CatalogStatistics(sizeInBytes = rawDataSize.get, rowCount = rowCount.filter(_ > 0))) } else { None } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19823 LGTM pending Jenkins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84360/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #84360 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84360/testReport)** for PR 18906 at commit [`0727597`](https://github.com/apache/spark/commit/0727597bfdb35d03df7050c9777b29efb9add4b1). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19839: SPARK-22373 Bump Janino dependency version to fix thread...
Github user Victsm commented on the issue: https://github.com/apache/spark/pull/19839 @srowen @kiszk @mgaido91 Is this patch ready to merge? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19856: [SPARK-22664] The logs about "Connected to Zookeeper" in...
Github user liu-zhaokun commented on the issue: https://github.com/apache/spark/pull/19856 @srowen Thanks for your reply.Could you help me review it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19805 Will review it this weekend. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r154240414 --- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala --- @@ -542,7 +496,54 @@ private[spark] class SecurityManager( * Gets the secret key. * @return the secret key as a String if authentication is enabled, otherwise returns null */ - def getSecretKey(): String = secretKey + def getSecretKey(): String = { +if (isAuthenticationEnabled) { + Option(sparkConf.getenv(ENV_AUTH_SECRET)) --- End diff -- I changed things and now they work pretty much as before. It would be good to separate secret generation from distribution, but I'd rather do that separately. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19631 **[Test build #84363 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84363/testReport)** for PR 19631 at commit [`c752453`](https://github.com/apache/spark/commit/c752453b2a379f301d52692bfc639bb631520069). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...
Github user rezasafi commented on a diff in the pull request: https://github.com/apache/spark/pull/19848#discussion_r154238528 --- Diff: core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala --- @@ -70,7 +70,8 @@ object SparkHadoopMapRedUtil extends Logging { if (shouldCoordinateWithDriver) { val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator val taskAttemptNumber = TaskContext.get().attemptNumber() -val canCommit = outputCommitCoordinator.canCommit(jobId, splitId, taskAttemptNumber) +val stageId = TaskContext.get().stageId() +val canCommit = outputCommitCoordinator.canCommit(stageId, splitId, taskAttemptNumber) --- End diff -- It make sense to use stageId there, since before jobId was used instead of stageId. I will test that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...
Github user rezasafi commented on a diff in the pull request: https://github.com/apache/spark/pull/19848#discussion_r154237986 --- Diff: core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala --- @@ -102,14 +103,15 @@ object SparkHadoopWriter extends Logging { context: TaskContext, config: HadoopWriteConfigUtil[K, V], jobTrackerId: String, + commitJobId: Int, sparkStageId: Int, sparkPartitionId: Int, sparkAttemptNumber: Int, committer: FileCommitProtocol, iterator: Iterator[(K, V)]): TaskCommitMessage = { // Set up a task. val taskContext = config.createTaskAttemptContext( - jobTrackerId, sparkStageId, sparkPartitionId, sparkAttemptNumber) + jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber) --- End diff -- I removed it a few min ago. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19848 **[Test build #84362 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84362/testReport)** for PR 19848 at commit [`92f9180`](https://github.com/apache/spark/commit/92f9180b0fea71ff3ae4aa3049e04d6f1e3167be). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19848#discussion_r154235818 --- Diff: core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala --- @@ -70,7 +70,8 @@ object SparkHadoopMapRedUtil extends Logging { if (shouldCoordinateWithDriver) { val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator val taskAttemptNumber = TaskContext.get().attemptNumber() -val canCommit = outputCommitCoordinator.canCommit(jobId, splitId, taskAttemptNumber) +val stageId = TaskContext.get().stageId() +val canCommit = outputCommitCoordinator.canCommit(stageId, splitId, taskAttemptNumber) --- End diff -- Shouldn't `CommitDeniedException` (below) be updated to use the stage ID also? Otherwise the exception might have incomplete information. With that change it's possible that `jobId` might become unused in this method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19848#discussion_r154236366 --- Diff: core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala --- @@ -102,14 +103,15 @@ object SparkHadoopWriter extends Logging { context: TaskContext, config: HadoopWriteConfigUtil[K, V], jobTrackerId: String, + commitJobId: Int, sparkStageId: Int, sparkPartitionId: Int, sparkAttemptNumber: Int, committer: FileCommitProtocol, iterator: Iterator[(K, V)]): TaskCommitMessage = { // Set up a task. val taskContext = config.createTaskAttemptContext( - jobTrackerId, sparkStageId, sparkPartitionId, sparkAttemptNumber) + jobTrackerId, commitJobId, sparkPartitionId, sparkAttemptNumber) --- End diff -- `sparkStageId` is now unused in this method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19848 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84355/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19848 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19848 **[Test build #84355 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84355/testReport)** for PR 19848 at commit [`f4ef351`](https://github.com/apache/spark/commit/f4ef351a05394ada5449e51a64606f0e5e7647c3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19828: [SPARK-22614] Dataset API: repartitionByRange(......
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19828 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19828: [SPARK-22614] Dataset API: repartitionByRange(...)
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19828 LGTM Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19714 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19714: [SPARK-22489][SQL] Shouldn't change broadcast join build...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19714 LGTM Thanks! Merged to master. Could you submit a follow-up PR to document the behavior changes in migration section of Spark SQL? https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19715 **[Test build #84357 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84357/testReport)** for PR 19715 at commit [`97ad483`](https://github.com/apache/spark/commit/97ad483d05740b43221d693c2f42f79609422a23). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19715 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84357/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19715 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #84361 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84361/testReport)** for PR 18906 at commit [`f8e4904`](https://github.com/apache/spark/commit/f8e490464d59a8802f68f1bf2c0d2a59e89dc008). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84361/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19717 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84354/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19717 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19717 **[Test build #84354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84354/testReport)** for PR 19717 at commit [`0f48f03`](https://github.com/apache/spark/commit/0f48f036d6d68fb06d24298a11cc2c139d00735e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #84361 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84361/testReport)** for PR 18906 at commit [`f8e4904`](https://github.com/apache/spark/commit/f8e490464d59a8802f68f1bf2c0d2a59e89dc008). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user ptkool commented on the issue: https://github.com/apache/spark/pull/18906 @holdenk I believe the changes in this PR match what's provided in the scala API. Am I missing something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19717 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19717 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84352/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19717 **[Test build #84352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84352/testReport)** for PR 19717 at commit [`357bd85`](https://github.com/apache/spark/commit/357bd8529e44266beacaec50bf6507da41e4cd24). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19826: [SPARK-22428][DOC] Add spark application garbage collect...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19826 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84358/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19826: [SPARK-22428][DOC] Add spark application garbage collect...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19826 **[Test build #84358 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84358/testReport)** for PR 19826 at commit [`241dbd0`](https://github.com/apache/spark/commit/241dbd069929c9f9eb6f1f9e45485693c9d4cd60). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19826: [SPARK-22428][DOC] Add spark application garbage collect...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19826 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #84360 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84360/testReport)** for PR 18906 at commit [`0727597`](https://github.com/apache/spark/commit/0727597bfdb35d03df7050c9777b29efb9add4b1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154225069 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- I used two statements instead of one statement is because, when i points to the last bin, this condition "value == histogram.bins(i + 1).lo" may be out of bound. By separating the conditions into two statements, we can be sure that the out-of-bound error will not happen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154223769 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- same comment as in my last reply. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154223705 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- I hesitate to add an assert statement here. This is because an assert such as this may cause Spark system to crash if a user does not fresh his data statistics quickly. In real world, a user may load data, collect statistics, and then add more incremental data, but does not collect statistics immediately. He may issue a SQL query against his newly added data such as "WHERE column=xxx", where xxx is a new value in his incremental load. After all, statistics are auxiliary, a query should still run even the statistics are not up to date. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84359/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18906 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #84359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84359/testReport)** for PR 18906 at commit [`90e1684`](https://github.com/apache/spark/commit/90e16845e2e24a835387c13537c841d55e1b85c7). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18906 **[Test build #84359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84359/testReport)** for PR 18906 at commit [`90e1684`](https://github.com/apache/spark/commit/90e16845e2e24a835387c13537c841d55e1b85c7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19826: [SPARK-22428][DOC] Add spark application garbage collect...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19826 **[Test build #84358 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84358/testReport)** for PR 19826 at commit [`241dbd0`](https://github.com/apache/spark/commit/241dbd069929c9f9eb6f1f9e45485693c9d4cd60). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19823 **[Test build #84356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84356/testReport)** for PR 19823 at commit [`f6eb4ad`](https://github.com/apache/spark/commit/f6eb4ada04e21760f805b3e50062f91a438cef56). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19715: [SPARK-22397][ML]add multiple columns support to Quantil...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19715 **[Test build #84357 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84357/testReport)** for PR 19715 at commit [`97ad483`](https://github.com/apache/spark/commit/97ad483d05740b43221d693c2f42f79609422a23). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19826: [SPARK-22428][DOC] Add spark application garbage collect...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19826 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19823 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19631 > For instance if someone is starting a server that is getting hit by other users that server could be started with the same env and then inadvertently expose the secret to other users. If I understand what you're saying correctly, that should be considered a security issue in that server application regardless of this change. The server should not be exposing its environment to unprivileged users. That being said, it seems Spark's own thrift server does that. If the following works in a spark-shell, it probably would do the same through the STS: ``` scala> spark.sql("set spark.sql.columnNameOfCorruptRecord=${env:SCALA_HOME}").show() ++--+ | key| value| ++--+ |spark.sql.columnN...|/apps/scala-2.11.7| ++--+ ``` So this change would expose that secret to users also in YARN mode (it's already exposed in Standalone and Mesos currently, because it's in the config). Let me think a little about this. I prefer the environment to the credentials approach because the latter are written to disk, but at the same time, that's less problematic than exposing the secret to users in the STS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18692 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18692 LGTM Thanks for your patience! It looks much good now. Really appreciate for your contributions! Welcome to make more contributions! Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r154218810 --- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala --- @@ -542,7 +496,54 @@ private[spark] class SecurityManager( * Gets the secret key. * @return the secret key as a String if authentication is enabled, otherwise returns null */ - def getSecretKey(): String = secretKey + def getSecretKey(): String = { +if (isAuthenticationEnabled) { + Option(sparkConf.getenv(ENV_AUTH_SECRET)) --- End diff -- I can add a different, internal config for this is re-using `SPARK_AUTH_SECRET_CONF` is confusing. But I'm not too concerned about exposing this to the user code running the application; they can just as easily get that info from the UGI currently. Spark already redacts this kind of information when writing it to things like the event log, which would be one place where it might leak out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19631#discussion_r154218236 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -58,18 +58,14 @@ import org.apache.spark.util.{CallerContext, Utils} private[spark] class Client( val args: ClientArguments, -val hadoopConf: Configuration, val sparkConf: SparkConf) extends Logging { import Client._ import YarnSparkHadoopUtil._ - def this(clientArgs: ClientArguments, spConf: SparkConf) = -this(clientArgs, SparkHadoopUtil.get.newConfiguration(spConf), spConf) - private val yarnClient = YarnClient.createYarnClient - private val yarnConf = new YarnConfiguration(hadoopConf) + private val hadoopConf = new YarnConfiguration(SparkHadoopUtil.newConfiguration(sparkConf)) --- End diff -- Both are effectively the same; I chose too keep the name used more often to minimize changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18692 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84351/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18692 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18692 **[Test build #84351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84351/testReport)** for PR 18692 at commit [`9ab91a1`](https://github.com/apache/spark/commit/9ab91a19cefd63b7d28674992b68da8164d487ae). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84353/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84353 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84353/testReport)** for PR 19811 at commit [`7ca94e1`](https://github.com/apache/spark/commit/7ca94e1ac2297e7c5e5de3a32a09050a010780fa). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/19717 @mridulm @jiangxb1987 @jerryshao @ueshin all comments so far have been addressed. The PR has also been updated to include a missing step for `local://` dependency support. PTAL. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19828: [SPARK-22614] Dataset API: repartitionByRange(...)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19828 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19828: [SPARK-22614] Dataset API: repartitionByRange(...)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19828 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84347/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19828: [SPARK-22614] Dataset API: repartitionByRange(...)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19828 **[Test build #84347 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84347/testReport)** for PR 19828 at commit [`012baa0`](https://github.com/apache/spark/commit/012baa0e088dbabd6d10f943e8a794caadfbb207). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154201470 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154200835 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154200309 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154199646 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) --- End diff -- shouldn't this line be removed? I think it is useless.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154199208 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154199027 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154197311 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) + +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) +val paramsFromSubExprs = getParamsForSubExprs(ctx, subExprs) +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length + +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * Given the list of eliminated subexpressions used in the children expressions, returns the + * strings of funtion parameters. The first is the variable names used to call the function, + * the second is the parameters used to declare the function in generated code. + */ + def getParamsForSubExprs( + ctx: CodegenContext, + subExprs: Seq[Expression]): Seq[(String, String)] = { +subExprs.flatMap { subExpr => + val argType = ctx.javaType(subExpr.dataType) + + val subExprState = ctx.subExprEliminationExprs(subExpr) + (subExprState.value, subExprState.isNull) + + if (!subExpr.nullable || subExprState.isNull == "true" || subExprState.isNull == "false") { +Seq((subExprState.value, s"$argType ${subExprState.value}")) + } else { +Seq((subExprState.value, s"$argType ${subExprState.value}"), + (subExprState.isNull, s"boolean ${subExprState.isNull}")) + } +}.distinct + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +
[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19591 Ping. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154195688 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) --- End diff -- nit: can we move this closer to when it is used, ie. before `val paramsFromColumns ...`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154195724 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val subExprs = getSubExprInChildren(ctx, expr) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19848 **[Test build #84355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84355/testReport)** for PR 19848 at commit [`f4ef351`](https://github.com/apache/spark/commit/f4ef351a05394ada5449e51a64606f0e5e7647c3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154194781 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,259 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) --- End diff -- I see that as of now `ctx.INPUT_ROW` can't be `null` here, but can we add an assertion and maybe a comment for the reason of this? I think this would help people who might want to reuse this function in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154193346 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -55,8 +55,20 @@ import org.apache.spark.util.{ParentClassLoader, Utils} * to null. * @param value A term for a (possibly primitive) value of the result of the evaluation. Not * valid if `isNull` is set to `true`. + * @param inputRow A term that holds the input row name when generating this code. + * @param inputVars A list of [[ExprInputVar]] that holds input variables when generating this code. */ -case class ExprCode(var code: String, var isNull: String, var value: String) +case class ExprCode( +var code: String, +var isNull: String, +var value: String, +var inputRow: String = null, +val inputVars: mutable.ArrayBuffer[ExprInputVar] = mutable.ArrayBuffer.empty) --- End diff -- if feasible, I'd prefer an immutable collection here (see my comment above) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154192687 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -115,9 +116,35 @@ abstract class Expression extends TreeNode[Expression] { } } + /** + * Records current input row and variables for this expression into created `ExprCode`. + */ + private def populateInputs(ctx: CodegenContext, eval: ExprCode): Unit = { +if (ctx.INPUT_ROW != null) { + eval.inputRow = ctx.INPUT_ROW +} +if (ctx.currentVars != null) { + val boundRefs = this.collect { +case b @ BoundReference(ordinal, _, _) if ctx.currentVars(ordinal) != null => (ordinal, b) + }.toMap + + ctx.currentVars.zipWithIndex.filter(_._1 != null).foreach { case (currentVar, idx) => --- End diff -- why is this filter needed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org