[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110699779 --- Diff: python/pyspark/sql/tests.py --- @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write two bucketing columns +df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), --- End diff -- Copying docs from Scala docs directly could be confusing since we won't support this in 2.0 and 2.1 and changes since 2.0 doesn't really affect us here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r110717717 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +108,60 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val eval = $(evaluator) val epm = $(estimatorParamMaps) val numModels = epm.length -val metrics = new Array[Double](epm.length) + +// Create execution context, run in serial if numParallelEval is 1 +val executionContext = $(numParallelEval) match { + case 1 => +ThreadUtils.sameThread + case n => +ExecutionContext.fromExecutorService(executorServiceFactory(n)) +} val instr = Instrumentation.create(this, dataset) instr.logParams(numFolds, seed) logTuningParams(instr) +// Compute metrics for each model over each split +logDebug(s"Running cross-validation with level of parallelism: $numParallelEval.") val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) -splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => +val metrics = splits.zipWithIndex.map { case ((training, validation), splitIndex) => val trainingDataset = sparkSession.createDataFrame(training, schema).cache() val validationDataset = sparkSession.createDataFrame(validation, schema).cache() - // multi-model training logDebug(s"Train split $splitIndex with multiple sets of parameters.") - val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] - trainingDataset.unpersist() - var i = 0 - while (i < numModels) { -// TODO: duplicate evaluator to take extra params from input -val metric = eval.evaluate(models(i).transform(validationDataset, epm(i))) -logDebug(s"Got metric $metric for model trained with ${epm(i)}.") -metrics(i) += metric -i += 1 + + // Fit models in a Future with thread-pool size determined by '$numParallelEval' + val models = epm.map { paramMap => +Future[Model[_]] { + val model = est.fit(trainingDataset, paramMap) + model.asInstanceOf[Model[_]] +} (executionContext) } + + Future.sequence[Model[_], Iterable](models)(implicitly, executionContext).onComplete { _ => +trainingDataset.unpersist() + } (executionContext) + + // Evaluate models in a Future with thread-pool size determined by '$numParallelEval' + val foldMetricFutures = models.zip(epm).map { case (modelFuture, paramMap) => +modelFuture.flatMap { model => + Future { +// TODO: duplicate evaluator to take extra params from input +val metric = eval.evaluate(model.transform(validationDataset, paramMap)) +logDebug(s"Got metric $metric for model trained with $paramMap.") +metric + } (executionContext) +} (executionContext) + } + + // Wait for metrics to be calculated before upersisting validation dataset + val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) --- End diff -- Sure, not a big deal either way --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110692150 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala --- @@ -220,6 +221,32 @@ class FileIndexSuite extends SharedSQLContext { assert(catalog.leafDirPaths.head == fs.makeQualified(dirPath)) } } + + test("SPARK-20280 - FileStatusCache with a partition with very many files") { +/* fake the size, otherwise we need to allocate 2GB of data to trigger this bug */ +class MyFileStatus extends FileStatus with KnownSizeEstimation { + override def estimatedSize: Long = 1000 * 1000 * 1000 +} +/* files * MyFileStatus.estimatedSize should overflow to negative integer + * so, make it between 2bn and 4bn + */ +val files = (1 to 3).map { i => + new MyFileStatus() +} +val fileStatusCache = FileStatusCache.getOrCreate(spark) +fileStatusCache.putLeafFiles(new Path("/tmp", "abc"), files.toArray) +// scalastyle:off --- End diff -- Lets remove this comment block, the JIRA should be used for tracking these things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110692271 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients - private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() -.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { - override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt - }}) -.removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { - override def onRemoval(removed: RemovalNotification[(ClientId, Path), Array[FileStatus]]) + private val cache: Cache[(ClientId, Path), Array[FileStatus]] = { +/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB --- End diff -- NIT: Could you use java style comments `//` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17566 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17330 **[Test build #75664 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75664/testReport)** for PR 17330 at commit [`f1e63c8`](https://github.com/apache/spark/commit/f1e63c8bc2a4689c15e8c11ddf5d2c7864cf96a6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17587: [SPARK-20274][SQL] support compatible array element type...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/17587 @cloud-fan Good to see what plan is generated in your comment. People (especially I) will forgot what plan we generated in the future. When I saw the comment in `upCastToExpectedType`, I understood this PR. I agree with @viirya' comment. LGTM except this comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17591 **[Test build #75665 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75665/testReport)** for PR 17591 at commit [`e02fc4a`](https://github.com/apache/spark/commit/e02fc4aa651f6803dc6f4844fc0570187a167e80). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75666/testReport)** for PR 17077 at commit [`481416d`](https://github.com/apache/spark/commit/481416d695d804144d041a98ea929b88829ebe47). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last in SQL
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17566 LGTM - merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r110693397 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -59,6 +58,13 @@ abstract class SubqueryExpression( children.zip(p.children).forall(p => p._1.semanticEquals(p._2)) case _ => false } + def canonicalize(attrs: AttributeSeq): SubqueryExpression = { +// Normalize the outer references in the subquery plan. +val subPlan = plan.transformAllExpressions { + case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs) --- End diff -- @cloud-fan Actually you r right. Preserving the OuterReference would be good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75667/testReport)** for PR 17077 at commit [`71c9e0f`](https://github.com/apache/spark/commit/71c9e0faf39b979eb7f61d74af8c1821d0a0bcf3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17308: [SPARK-19968][SS] Use a cached instance of `Kafka...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/17308#discussion_r110685760 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.{util => ju} + +import scala.collection.mutable + +import org.apache.kafka.clients.producer.KafkaProducer + +import org.apache.spark.internal.Logging + +private[kafka010] object CachedKafkaProducer extends Logging { + + private val cacheMap = new mutable.HashMap[Int, KafkaProducer[Array[Byte], Array[Byte]]]() + + private def createKafkaProducer( +producerConfiguration: ju.HashMap[String, Object]): KafkaProducer[Array[Byte], Array[Byte]] = { +val kafkaProducer: KafkaProducer[Array[Byte], Array[Byte]] = + new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration) +cacheMap.put(producerConfiguration.hashCode(), kafkaProducer) --- End diff -- True, my bad I thought `KafkaSink` was a public API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17593: [SPARK-20279][WEB-UI]In web ui,'Only showing 200' should...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/17593 I agree with @srowen we left it that way since sorting can change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17592 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17591 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17520 **[Test build #75668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75668/testReport)** for PR 17520 at commit [`b923bd5`](https://github.com/apache/spark/commit/b923bd5b2c79a84ada834a32b756ad0da80f12c6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110693499 --- Diff: python/pyspark/sql/tests.py --- @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write two bucketing columns +df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), --- End diff -- If you think it is better I'll trust your judgment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17585: [SPARK-20273] [SQL] Disallow Non-deterministic Fi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17585 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17364: [SPARK-20038] [SQL]: FileFormatWriter.ExecuteWriteTask.r...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/17364 @squito Is this ready to go in? Like I warned, I'm not going to add tests for this, not on its own --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17550: [SPARK-20240][SQL] SparkSQL support limitations of max d...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17550 We do not add the new things into 1.6 branch. Please open the PR using the master branch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17480: [SPARK-20079][Core][yarn] Re registration of AM h...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17480#discussion_r110716862 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager( * yarn-client mode when AM re-registers after a failure. */ def reset(): Unit = synchronized { -initializing = true +/** + * When some tasks need to be scheduled and initial executor = 0, resetting the initializing + * field may cause it to not be set to false in yarn. + * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079 + */ +if (maxNumExecutorsNeeded() == 0) { + initializing = true --- End diff -- This kinda raises the question. Is it ever correct to set this to `true` here? This method is only called when the YARN client-mode AM is restarted, and at that point I'd expect initialization to have already happened (so I don't see a need to reset the field in any situation). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpen...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17592 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r110713910 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -100,31 +108,60 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val eval = $(evaluator) val epm = $(estimatorParamMaps) val numModels = epm.length -val metrics = new Array[Double](epm.length) + +// Create execution context, run in serial if numParallelEval is 1 +val executionContext = $(numParallelEval) match { + case 1 => +ThreadUtils.sameThread + case n => +ExecutionContext.fromExecutorService(executorServiceFactory(n)) +} val instr = Instrumentation.create(this, dataset) instr.logParams(numFolds, seed) logTuningParams(instr) +// Compute metrics for each model over each split +logDebug(s"Running cross-validation with level of parallelism: $numParallelEval.") val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) -splits.zipWithIndex.foreach { case ((training, validation), splitIndex) => +val metrics = splits.zipWithIndex.map { case ((training, validation), splitIndex) => val trainingDataset = sparkSession.createDataFrame(training, schema).cache() val validationDataset = sparkSession.createDataFrame(validation, schema).cache() - // multi-model training logDebug(s"Train split $splitIndex with multiple sets of parameters.") - val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]] - trainingDataset.unpersist() - var i = 0 - while (i < numModels) { -// TODO: duplicate evaluator to take extra params from input -val metric = eval.evaluate(models(i).transform(validationDataset, epm(i))) -logDebug(s"Got metric $metric for model trained with ${epm(i)}.") -metrics(i) += metric -i += 1 + + // Fit models in a Future with thread-pool size determined by '$numParallelEval' + val models = epm.map { paramMap => +Future[Model[_]] { + val model = est.fit(trainingDataset, paramMap) + model.asInstanceOf[Model[_]] +} (executionContext) } + + Future.sequence[Model[_], Iterable](models)(implicitly, executionContext).onComplete { _ => +trainingDataset.unpersist() + } (executionContext) + + // Evaluate models in a Future with thread-pool size determined by '$numParallelEval' + val foldMetricFutures = models.zip(epm).map { case (modelFuture, paramMap) => +modelFuture.flatMap { model => + Future { +// TODO: duplicate evaluator to take extra params from input +val metric = eval.evaluate(model.transform(validationDataset, paramMap)) +logDebug(s"Got metric $metric for model trained with $paramMap.") +metric + } (executionContext) +} (executionContext) + } + + // Wait for metrics to be calculated before upersisting validation dataset + val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) --- End diff -- I thought about that, but since it's a blocking call anyway, it will still be bound by the longest running thread. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17592 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75662/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17592 **[Test build #75662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75662/testReport)** for PR 17592 at commit [`d486d60`](https://github.com/apache/spark/commit/d486d6015ae0129ce41e6683eae37243c843ba59). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75667 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75667/testReport)** for PR 17077 at commit [`71c9e0f`](https://github.com/apache/spark/commit/71c9e0faf39b979eb7f61d74af8c1821d0a0bcf3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75667/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r110714213 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -67,6 +67,9 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria if (loadDefaults) { loadFromSystemProperties(false) } + // tentatively enable offHeap for test + set("spark.memory.offHeap.enabled", "true") + set("spark.memory.offHeap.size", "1gb") --- End diff -- Yes, it should be removed before merging this PR. This code is tentative added to enable offheap mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17592 Merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/17591#discussion_r110692833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala --- @@ -94,27 +94,46 @@ private class SharedInMemoryCache(maxSizeInBytes: Long) extends Logging { // Opaque object that uniquely identifies a shared cache user private type ClientId = Object + private val warnedAboutEviction = new AtomicBoolean(false) // we use a composite cache key in order to distinguish entries inserted by different clients - private val cache: Cache[(ClientId, Path), Array[FileStatus]] = CacheBuilder.newBuilder() -.weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { - override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { -(SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt - }}) -.removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { - override def onRemoval(removed: RemovalNotification[(ClientId, Path), Array[FileStatus]]) + private val cache: Cache[(ClientId, Path), Array[FileStatus]] = { +/* [[Weigher]].weigh returns Int so we could only cache objects < 2GB + * instead, the weight is divided by this factor (which is smaller + * than the size of one [[FileStatus]]). + * so it will support objects up to 64GB in size. + */ +val weightScale = 32 +CacheBuilder.newBuilder() + .weigher(new Weigher[(ClientId, Path), Array[FileStatus]] { +override def weigh(key: (ClientId, Path), value: Array[FileStatus]): Int = { + val estimate = (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)) / weightScale + if (estimate > Int.MaxValue) { +logWarning(s"Cached table partition metadata size is too big. Approximating to " + + s"${Int.MaxValue.toLong * weightScale}.") +Int.MaxValue + } else { +estimate.toInt + } +} + }) + .removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { --- End diff -- This is kinda hard to read. Can we just initialize the weighter and the listener in separate variables? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for...
Github user zero323 commented on a diff in the pull request: https://github.com/apache/spark/pull/17077#discussion_r110692936 --- Diff: python/pyspark/sql/tests.py --- @@ -2167,6 +2167,61 @@ def test_BinaryType_serialization(self): df = self.spark.createDataFrame(data, schema=schema) df.collect() +def test_bucketed_write(self): +data = [ +(1, "foo", 3.0), (2, "foo", 5.0), +(3, "bar", -1.0), (4, "bar", 6.0), +] +df = self.spark.createDataFrame(data, ["x", "y", "z"]) + +# Test write with one bucketing column +df.write.bucketBy(3, "x").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write two bucketing columns +df.write.bucketBy(3, "x", "y").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), +2 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort +df.write.bucketBy(2, "x").sortBy("z").mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name == "x" and c.isBucket]), +1 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with a list of columns +df.write.bucketBy(3, ["x", "y"]).mode("overwrite").saveAsTable("pyspark_bucket") +self.assertEqual( +len([c for c in self.spark.catalog.listColumns("pyspark_bucket") + if c.name in ("x", "y") and c.isBucket]), +2 +) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort with a list of columns +(df.write.bucketBy(2, "x") +.sortBy(["y", "z"]) +.mode("overwrite").saveAsTable("pyspark_bucket")) +self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) + +# Test write with bucket and sort with multiple columns +(df.write.bucketBy(2, "x") +.sortBy("y", "z") +.mode("overwrite").saveAsTable("pyspark_bucket")) --- End diff -- I don't think that dropping before is necessary. We override on each write and name clashes are unlikely. We can drop down after the tests but I am not sure how to do it right. `SQLTests` is overgrown and I am not sure if we should add `tearDown` only for this but adding `DROP TABLE` in test itself doesn't look right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12823: [SPARK-14985][ML] Update LinearRegression, LogisticRegre...
Github user BenFradet commented on the issue: https://github.com/apache/spark/pull/12823 ping @jkbradley if you could take a look, that'd be great. If you have the time, there is also the #17431 segue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17077 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75666/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17077: [SPARK-16931][PYTHON][SQL] Add Python wrapper for bucket...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17077 **[Test build #75666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75666/testReport)** for PR 17077 at commit [`481416d`](https://github.com/apache/spark/commit/481416d695d804144d041a98ea929b88829ebe47). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16774: [SPARK-19357][ML] Adding parallel model evaluatio...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/16774#discussion_r110712166 --- Diff: docs/ml-tuning.md --- @@ -55,6 +55,9 @@ for multiclass problems. The default metric used to choose the best `ParamMap` c method in each of these evaluators. To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility. +Sets of parameters from the parameter grid can be evaluated in parallel by setting `numParallelEval` with a value of 2 or more (a value of 1 will evaluate in serial) before running model selection with `CrossValidator` or `TrainValidationSplit`. +The value of `numParallelEval` should be chosen carefully to maximize parallelism without exceeding cluster resources, and will be capped at the number of cores in the driver system. Generally speaking, a value up to 10 should be sufficient for most clusters. + --- End diff -- Since that API is marked as experimental, maybe it would be better to not document right away until we are sure this is what we need? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17592: [SPARK-20243][TESTS] DebugFilesystem.assertNoOpenStreams...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17592 Should this go into branch-2.1 as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17330 **[Test build #75664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75664/testReport)** for PR 17330 at commit [`f1e63c8`](https://github.com/apache/spark/commit/f1e63c8bc2a4689c15e8c11ddf5d2c7864cf96a6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17595 **[Test build #75671 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75671/testReport)** for PR 17595 at commit [`263dc68`](https://github.com/apache/spark/commit/263dc688a1da8cfeef9f9e7e0168964a27e779b6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...
Github user umehrot2 commented on the issue: https://github.com/apache/spark/pull/17445 @kayousterhout @mridulm @rxin @lins05 Can you take a look at this PR ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17597 **[Test build #75673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75673/testReport)** for PR 17597 at commit [`67a1221`](https://github.com/apache/spark/commit/67a12213b8abbb9c41333ff9e0ffb3f70944a12e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17527 Merged to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/17594 LGTM, for fixing the issue with the test. We should separately decide if this is really the behavior we want for the commit log. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17596 **[Test build #75672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75672/testReport)** for PR 17596 at commit [`67044b3`](https://github.com/apache/spark/commit/67044b34ac60863d3fbdbe37ba89fffa85ffb2fe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17596 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75672/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17596 **[Test build #75672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75672/testReport)** for PR 17596 at commit [`67044b3`](https://github.com/apache/spark/commit/67044b34ac60863d3fbdbe37ba89fffa85ffb2fe). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17596 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17597 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17597 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75673/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17595 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75671/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17595 **[Test build #75671 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75671/testReport)** for PR 17595 at commit [`263dc68`](https://github.com/apache/spark/commit/263dc688a1da8cfeef9f9e7e0168964a27e779b6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17595 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17495: [SPARK-20172][Core] Add file permission check whe...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17495#discussion_r110722790 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -320,14 +321,15 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) .filter { entry => try { val prevFileSize = fileToAppInfo.get(entry.getPath()).map{_.fileSize}.getOrElse(0L) +fs.access(entry.getPath, FsAction.READ) --- End diff -- So, this API actually calls `fs.getFileStatus()` which is a remote call and completely unnecessary in this context, since you already have the `FileStatus`. You could instead directly do the check against the `FsPermission` object (same way the `fs.access()` does after it performs the remote call). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17591 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17591 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75665/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17591 **[Test build #75665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75665/testReport)** for PR 17591 at commit [`e02fc4a`](https://github.com/apache/spark/commit/e02fc4aa651f6803dc6f4844fc0570187a167e80). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17594: Write the log first to fix a race contion in test...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/17594 Write the log first to fix a race contion in tests ## What changes were proposed in this pull request? This PR fixes the following failure: ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: Assert on query failed: == Progress == AssertOnQuery(, ) StopStream AddData to MemoryStream[value#30891]: 1,2 StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@35cdc93a,Map()) CheckAnswer: [6],[3] StopStream => AssertOnQuery(, ) AssertOnQuery(, ) StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@cdb247d,Map()) CheckAnswer: [6],[3] StopStream AddData to MemoryStream[value#30891]: 3 StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@55394e4d,Map()) CheckLastBatch: [2] StopStream AddData to MemoryStream[value#30891]: 0 StartStream(OneTimeTrigger,org.apache.spark.util.SystemClock@749aa997,Map()) ExpectFailure[org.apache.spark.SparkException, isFatalError: false] AssertOnQuery(, ) AssertOnQuery(, incorrect start offset or end offset on exception) == Stream == Output Mode: Append Stream state: not started Thread state: dead == Sink == 0: [6] [3] == Plan == at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.fail(Assertions.scala:1328) at org.scalatest.FunSuite.fail(FunSuite.scala:1555) at org.apache.spark.sql.streaming.StreamTest$class.failTest$1(StreamTest.scala:347) at org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:318) at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:483) at org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:357) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.sql.streaming.StreamTest$class.liftedTree1$1(StreamTest.scala:357) at org.apache.spark.sql.streaming.StreamTest$class.testStream(StreamTest.scala:356) at org.apache.spark.sql.streaming.StreamingQuerySuite.testStream(StreamingQuerySuite.scala:41) at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply$mcV$sp(StreamingQuerySuite.scala:166) at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) at org.apache.spark.sql.streaming.StreamingQuerySuite$$anonfun$6.apply(StreamingQuerySuite.scala:161) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:268) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:268) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQuerySuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.sql.streaming.StreamingQuerySuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQuerySuite.scala:41) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.sql.streaming.StreamingQuerySuite.runTest(StreamingQuerySuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110741246 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join --- End diff -- @cloud-fan The ```outer/inner``` represents the join plan combinations generated by ```JoinReorderDP```. ```JoinReorderDP``` calls them ```oneJoinPlan/otherJoinPlan```. I will rename them to align to join DP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17598: [SPARK-20284][CORE] Make {Des,S}erializationStream exten...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17598 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/17597 LGTM! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17597: [SPARK-20285][Tests]Increase the pyspark streamin...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17597 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/17594 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17594: [SPARK-20282][SS][Tests]Write the commit log firs...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/17594#discussion_r110735241 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -304,8 +304,8 @@ class StreamExecution( finishTrigger(dataAvailable) if (dataAvailable) { // Update committed offsets. -committedOffsets ++= availableOffsets batchCommitLog.add(currentBatchId) --- End diff -- This is existing, but do we not write to the commit log if there is no new data in the next batch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110742657 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -218,28 +220,44 @@ object JoinReorderDP extends PredicateHelper with Logging { } /** - * Builds a new JoinPlan when both conditions hold: + * Builds a new JoinPlan if the following conditions hold: * - the sets of items contained in left and right sides do not overlap. * - there exists at least one join condition involving references from both sides. + * - if star-join filter is enabled, allow the following combinations: + * 1) (oneJoinPlan U otherJoinPlan) is a subset of star-join + * 2) star-join is a subset of (oneJoinPlan U otherJoinPlan) + * 3) (oneJoinPlan U otherJoinPlan) is a subset of non star-join + * * @param oneJoinPlan One side JoinPlan for building a new JoinPlan. * @param otherJoinPlan The other side JoinPlan for building a new join node. * @param conf SQLConf for statistics computation. * @param conditions The overall set of join conditions. * @param topOutput The output attributes of the final plan. + * @param filters Join graph info to be used as filters by the search algorithm. * @return Builds and returns a new JoinPlan if both conditions hold. Otherwise, returns None. */ private def buildJoin( oneJoinPlan: JoinPlan, otherJoinPlan: JoinPlan, conf: SQLConf, conditions: Set[Expression], - topOutput: AttributeSet): Option[JoinPlan] = { + topOutput: AttributeSet, + filters: Option[JoinGraphInfo]): Option[JoinPlan] = { if (oneJoinPlan.itemIds.intersect(otherJoinPlan.itemIds).nonEmpty) { // Should not join two overlapping item sets. return None } +if (conf.joinReorderDPStarFilter && filters.isDefined) { + // Apply star-join filter, which ensures that tables in a star schema relationship + // are planned together. + val isValidJoinCombination = +JoinReorderDPFilters(conf).starJoinFilter(oneJoinPlan.itemIds, otherJoinPlan.itemIds, + filters.get) + if (!isValidJoinCombination) return None --- End diff -- @cloud-fan The star filter will eliminate joins among star and non-star tables until the star-joins are build. The assumption is that star-join should be planned together. I will add more comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17594 **[Test build #75670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75670/testReport)** for PR 17594 at commit [`6b63179`](https://github.com/apache/spark/commit/6b631799033f708ae61338a3fc8e5d63255fa803). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17596: [SPARK-12837][SQL] reduce the serialized size of accumul...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17596 cc @rxin @davies @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17596: [SPARK-12837][SQL] reduce the serialized size of ...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/17596 [SPARK-12837][SQL] reduce the serialized size of accumulator ## What changes were proposed in this pull request? When sending accumulator updates back to driver, the network overhead is pretty big as there are a lot of accumulators, e.g. `TaskMetrics` will send about 20 accumulators everytime, there may be a lot of `SQLMetric` if the query plan is complicated. Therefore, it's critical to reduce the size of serialized accumulator. This PR proposed 2 ways: 1. Do not send the accumulator name to executor side, as it's unnecessary. When executor sends accumulator updates back to driver, we can look up the accumulator name in `AccumulatorContext` easily. 2. Introduce `InternalLongAccumulator` which can slightly reduce the size, but can also take over the `setValue` functionality from `LongAccumulator`, which doesn't fit well. Tried on the example in https://issues.apache.org/jira/browse/SPARK-12837, the size of serialized accumulator has been cut down around 50%. ## How was this patch tested? existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark oom Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17596.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17596 commit 67044b34ac60863d3fbdbe37ba89fffa85ffb2fe Author: Wenchen FanDate: 2017-04-10T12:31:47Z reduce the serialized size of accumulator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17598: [SPARK-20284][CORE] Make {Des,S}erializationStrea...
GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/17598 [SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable ## What changes were proposed in this pull request? This PR allows to use `SerializationStream` and `DeserializationStream` in try-with-resources. ## How was this patch tested? `core` unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/criteo-forks/spark compression-stream-closeable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17598.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17598 commit 75ba026db26171e0ed59d48d0ab2855f2a2af757 Author: Sergei LebedevDate: 2017-04-10T19:32:12Z [SPARK-20284][CORE] Make {Des,S}erializationStream extend Closeable This change allows to use these streams in try-with-resources. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17591: [SPARK-20280][CORE] FileStatusCache Weigher integer over...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17591 Merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110740755 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -54,8 +54,6 @@ case class CostBasedJoinReorder(conf: SQLConf) extends Rule[LogicalPlan] with Pr private def reorder(plan: LogicalPlan, output: Seq[Attribute]): LogicalPlan = { val (items, conditions) = extractInnerJoins(plan) -// TODO: Compute the set of star-joins and use them in the join enumeration -// algorithm to prune un-optimal plan choices. --- End diff -- @cloud-fan Star-schema detection is first called to compute the set of tables connected by star-schema relationship e.g. {F1, D1, D2} in our code example. This call does not do any join reordering among the tables. It simply computes the set of tables in a star-schema relationship. Then, DP join enumeration generates all possible plan combinations among the entire set of tables in a the join e.g. {F1, D1}, {F1, T1}, {T2, T3}, etc. Star-filter, if called, will eliminate plan combinations among the star and non-star tables until the star join combinations are built. For example, {F1, D1} combination will be retained since it involves tables in a star schema, but {F1, T1} will be eliminated since it mixes star and non-star tables. Star-filter simply decides what combinations to retain but it will not decide on the order of execution of those tables. The order of the joins within a star-join and for the overall plan is decided by the DP join enumeration. Star-filter only ensures that tables in a star-join are planned together. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75668/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17520 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17568 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75669/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17568 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17495: [SPARK-20172][Core] Add file permission check whe...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/17495#discussion_r110721933 --- Diff: core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala --- @@ -571,6 +572,34 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc } } + test("log without read permission should be filtered out before actual reading") { +class TestFsHistoryProvider extends FsHistoryProvider(createTestConf()) { --- End diff -- I think you'll need this from another test in this same file: ``` // setReadable(...) does not work on Windows. Please refer JDK-6728842. assume(!Utils.isWindows) ``` In fact shouldn't this test be merged with the test for SPARK-3697? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17330 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17330 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75664/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17546: [SPARK-20233] [SQL] Apply star-join filter heuris...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/17546#discussion_r110742361 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -327,3 +345,104 @@ object JoinReorderDP extends PredicateHelper with Logging { case class Cost(card: BigInt, size: BigInt) { def +(other: Cost): Cost = Cost(this.card + other.card, this.size + other.size) } + +/** + * Implements optional filters to reduce the search space for join enumeration. + * + * 1) Star-join filters: Plan star-joins together since they are assumed + *to have an optimal execution based on their RI relationship. + * 2) Cartesian products: Defer their planning later in the graph to avoid + *large intermediate results (expanding joins, in general). + * 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing + * intermediate results. + * + * Filters (2) and (3) are not implemented. + */ +case class JoinReorderDPFilters(conf: SQLConf) extends PredicateHelper { + /** + * Builds join graph information to be used by the filtering strategies. + * Currently, it builds the sets of star/non-star joins. + * It can be extended with the sets of connected/unconnected joins, which + * can be used to filter Cartesian products. + */ + def buildJoinGraphInfo( + items: Seq[LogicalPlan], + conditions: Set[Expression], + planIndex: Seq[(LogicalPlan, Int)]): Option[JoinGraphInfo] = { + +// Compute the tables in a star-schema relationship. +val starJoin = StarSchemaDetection(conf).findStarJoins(items, conditions.toSeq) +val nonStarJoin = items.filterNot(starJoin.contains(_)) + +if (starJoin.nonEmpty && nonStarJoin.nonEmpty) { + val (starInt, nonStarInt) = planIndex.collect { +case (p, i) if starJoin.contains(p) => + (Some(i), None) +case (p, i) if nonStarJoin.contains(p) => + (None, Some(i)) +case _ => + (None, None) + }.unzip + Some(JoinGraphInfo(starInt.flatten.toSet, nonStarInt.flatten.toSet)) +} else { + // Nothing interesting to return. + None +} + } + + /** + * Applies star-join filter. + * + * Given the outer/inner and the star/non-star sets, + * the following plan combinations are allowed: + * 1) (outer U inner) is a subset of star-join + * 2) star-join is a subset of (outer U inner) + * 3) (outer U inner) is a subset of non star-join + * + * It assumes the sets are disjoint. + * + * Example query graph: + * + * t1 d1 - t2 - t3 + * \ / + * f1 + * | + * d2 + * + * star: {d1, f1, d2} + * non-star: {t2, t1, t3} --- End diff -- @cloud-fan ```outer/inner``` i.e. ```oneJoinPlan/otherJoinPlan``` represent all the plan permutations generated by ```JoinReorderDP```. For example, at level 1, join enumeration will combine plans from level 0 e.g. ```oneJoinPlan = (f1)``` and ```otherJoinPlan = (d1)```. At level 2, it will generate plans from plan combinations at level 0 and level 1 e.g. ```oneJoinPlan = (d2)``` and ```otherJoinPlan = {f1, d1}```, and so on. I will clarify the comment with more details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...
Github user nihavend commented on the issue: https://github.com/apache/spark/pull/17527 Thank you very much all of you for all your efforts. Sometimes, facing the same issue different platforms and looking for a way to set jvm options for locale explicitly. But many times there is no way. And mostly the platform owners do not take this as serious problem and take care as you do. For this reason, on the platforms leaving user an open door to define the locale manually may solve most of the issues related with the locale problem. Hope helped some for future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17566: [SPARK-19518][SQL] IGNORE NULLS in first / last in SQL
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17566 Thank you @hvanhovell. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17597: [SPARK-20285][Tests]Increase the pyspark streamin...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/17597 [SPARK-20285][Tests]Increase the pyspark streaming test timeout to 30 seconds ## What changes were proposed in this pull request? Saw the following failure locally: ``` Traceback (most recent call last): File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup self._test_func(input, func, expected, sort=True, input2=input2) File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] First list contains 3 additional elements. First extra element 0: [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] + [] - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] ``` It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds. ## How was this patch tested? Jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-20285 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17597 commit 67a12213b8abbb9c41333ff9e0ffb3f70944a12e Author: Shixiong ZhuDate: 2017-04-10T19:47:21Z Increase the pyspark streaming test timeout to 30 seconds --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17595: [SPARK-20283][SQL] Add preOptimizationBatches
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/17595 LGTM - pending jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17591: [SPARK-20280][CORE] FileStatusCache Weigher integ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17591 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17520: [WIP][SPARK-19712][SQL] Move PullupCorrelatedPredicates ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17520 **[Test build #75668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75668/testReport)** for PR 17520 at commit [`b923bd5`](https://github.com/apache/spark/commit/b923bd5b2c79a84ada834a32b756ad0da80f12c6). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SparkListenerBlockManagerAdded(` * `class StorageStatus(` * `public final class JavaStructuredSessionization ` * ` public static class LineWithTimestamp implements Serializable ` * ` public static class Event implements Serializable ` * ` public static class SessionInfo implements Serializable ` * ` public static class SessionUpdate implements Serializable ` * `case class Event(sessionId: String, timestamp: Timestamp)` * `case class SessionInfo(` * `case class SessionUpdate(` * `class Correlation(object):` * `case class UnresolvedMapObjects(` * `case class AssertNotNull(child: Expression, walkedTypePath: Seq[String] = Nil)` * `case class StarSchemaDetection(conf: SQLConf) extends PredicateHelper ` * ` * Helper case class to hold (plan, rowCount) pairs.` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/17597 Thanks! Merging to master, 2.1 and 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17582: [SPARK-20239][Core] Improve HistoryServer's ACL mechanis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17582 > user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed So to me this is the only bug; which means that maybe ACLs on the listing itself shouldn't ever be applied, and this PR should be a lot simpler, right? Most of it seem to be dealing with filtering the list of apps so that only applications the user can see are shown. I wonder if that's necessary, since the only thing that's showing is the existence of the application, not any data about it that could be considered sensitive. There's also a minor thing that the listing being different for different users might cause confusion; but if there's a good reason for filtering, then that concern can be overridden. I'm just not sure there is a good reason for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17568 **[Test build #75669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75669/testReport)** for PR 17568 at commit [`3be368c`](https://github.com/apache/spark/commit/3be368c84f82f65baf475805b3f56441cece3128). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17594: [SPARK-20282][SS][Tests]Write the commit log firs...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/17594#discussion_r110735710 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -304,8 +304,8 @@ class StreamExecution( finishTrigger(dataAvailable) if (dataAvailable) { // Update committed offsets. -committedOffsets ++= availableOffsets batchCommitLog.add(currentBatchId) --- End diff -- I thinks so. I think it's because we only update `currentBatchId` when data is available. cc @tdas --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17595: [SPARK-20283][SQL] Add preOptimizationBatches
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/17595 [SPARK-20283][SQL] Add preOptimizationBatches ## What changes were proposed in this pull request? We currently have postHocOptimizationBatches, but not preOptimizationBatches. This patch adds preOptimizationBatches so the optimizer debugging extensions are symmetric. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-20283 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17595 commit 263dc688a1da8cfeef9f9e7e0168964a27e779b6 Author: Reynold XinDate: 2017-04-10T19:00:24Z [SPARK-20283][SQL] Add preOptimizationBatches --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16347 is this still a problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17527 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17597: [SPARK-20285][Tests]Increase the pyspark streaming test ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17597 **[Test build #75673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75673/testReport)** for PR 17597 at commit [`67a1221`](https://github.com/apache/spark/commit/67a12213b8abbb9c41333ff9e0ffb3f70944a12e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17546: [SPARK-20233] [SQL] Apply star-join filter heuristics to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17546 **[Test build #75674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75674/testReport)** for PR 17546 at commit [`7a5d1d0`](https://github.com/apache/spark/commit/7a5d1d0615a98fbee3c58934f92314bb92a97354). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17594: [SPARK-20282][SS][Tests]Write the commit log firs...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/17594#discussion_r110760826 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -304,8 +304,8 @@ class StreamExecution( finishTrigger(dataAvailable) if (dataAvailable) { // Update committed offsets. -committedOffsets ++= availableOffsets batchCommitLog.add(currentBatchId) --- End diff -- Technically, when there is a batch with data, it finishes executing and then increments the batch id for the next batch. But if next batch has no data, then the batchid not further increment, and gets eventually used in a future batch with data. So i think it is correct to update the commit log as soon this batch is done, and before the batch id is incremented for next batch. This change maintains that invariant as far as i think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17568: [SPARK-20254][SQL] Remove unnecessary data conversion fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17568 **[Test build #75669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75669/testReport)** for PR 17568 at commit [`3be368c`](https://github.com/apache/spark/commit/3be368c84f82f65baf475805b3f56441cece3128). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17594 **[Test build #75670 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75670/testReport)** for PR 17594 at commit [`6b63179`](https://github.com/apache/spark/commit/6b631799033f708ae61338a3fc8e5d63255fa803). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17594 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75670/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17594: [SPARK-20282][SS][Tests]Write the commit log first to fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17594 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org