[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21556 **[Test build #93014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93014/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaSummarizerExample ` * ` class SerializableConfiguration(@transient var value: Configuration)` * ` class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)` * ` case class SchemaType(dataType: DataType, nullable: Boolean)` * ` implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) ` * ` implicit class AvroDataFrameReader(reader: DataFrameReader) ` * `class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],` * `trait ComplexTypeMergingExpression extends Expression ` * `case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes ` * `abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast ` * `case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93014/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/961/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21657 **[Test build #93015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93015/testReport)** for PR 21657 at commit [`81b3971`](https://github.com/apache/spark/commit/81b397140486fab7f7c2f7dcb15d5a9a62c99845). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21754: [SPARK-24705][SQL] Cannot reuse an exchange operator wit...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21754 ping --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/962/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20629 **[Test build #93016 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93016/testReport)** for PR 20629 at commit [`926c353`](https://github.com/apache/spark/commit/926c35309e39b9137f6637a79f64bd22f6da84e0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...
Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/21740#discussion_r202534384 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModelSuite.scala --- @@ -72,6 +72,22 @@ class MatrixFactorizationModelSuite extends SparkFunSuite with MLlibTestSparkCon } } + test("invalid user and product") { +val model = new MatrixFactorizationModel(rank, userFeatures, prodFeatures) +assert(intercept[IllegalArgumentException] { --- End diff -- Thanks for the review. Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...
Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/21740#discussion_r202534469 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -75,10 +75,22 @@ class MatrixFactorizationModel @Since("0.8.0") ( } } + /** Check for the invalid user. */ + private def validateUser(user: Int): Unit = { +require(userFeatures.lookup(user).nonEmpty, s"userId: $user not found in the model") --- End diff -- I have renamed the method to 'validateAndGetUser', where it check, whether the user exist or not and it returns the corresponding user feature. Similarly for the product also. Please let me know if anymore changes required. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21770: [SPARK-24806][SQL] Brush up generated code so tha...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21770#discussion_r202534515 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -318,7 +318,8 @@ case class SampleExec( v => s""" | $v = new $samplerClass($lowerBound, $upperBound, false); | $v.setSeed(${seed}L + partitionIndex); - """.stripMargin.trim) + """.stripMargin.trim, +forceInline = true) --- End diff -- why do we need this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21770: [SPARK-24806][SQL] Brush up generated code so tha...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21770#discussion_r202534499 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -3758,7 +3758,10 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike } else { val arrayUnion = classOf[ArrayUnion].getName val et = ctx.addReferenceObj("elementTypeUnion", elementType) -val order = ctx.addReferenceObj("orderingUnion", ordering) +// Some data types (e.g., `BinaryType`) have anonymous classes for ordering and +// `getCanonicalName` returns null in these classes. Therefore, we need to +// explicitly set `className` here. --- End diff -- nit: as we are adding this comment, shall we also mention that Janino works anyway, but JDK complains here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21770: [SPARK-24806][SQL] Brush up generated code so tha...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21770#discussion_r202534510 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala --- @@ -1585,6 +1585,9 @@ case class InitializeJavaBean(beanInstance: Expression, setters: Map[String, Exp } override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +// Resolves setters before compilation +require(resolvedSetters.nonEmpty) --- End diff -- why do we need to add this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21748: [SPARK-23146][K8S] Support client mode.
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21748#discussion_r202534636 --- Diff: docs/running-on-kubernetes.md --- @@ -399,18 +426,18 @@ specific to Spark on Kubernetes. Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. -Specify this as a path as opposed to a URI (i.e. do not provide a scheme). +Specify this as a path as opposed to a URI (i.e. do not provide a scheme). In client mode, use +spark.kubernetes.authenticate.caCertFile instead. spark.kubernetes.authenticate.driver.clientKeyFile (none) Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting -executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. -Specify this as a path as opposed to a URI (i.e. do not provide a scheme). If this is specified, it is highly -recommended to set up TLS for the driver submission server, as this value is sensitive information that would be -passed to the driver pod in plaintext otherwise. --- End diff -- why remove ```If this is specified, it is highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would be passed to the driver pod in plaintext otherwise.```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21748: [SPARK-23146][K8S] Support client mode.
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21748#discussion_r202535016 --- Diff: docs/running-on-kubernetes.md --- @@ -117,6 +117,37 @@ If the local proxy is running at localhost:8001, `--master k8s://http://127.0.0. spark-submit. Finally, notice that in the above example we specify a jar with a specific URI with a scheme of `local://`. This URI is the location of the example jar that is already in the Docker image. +## Client Mode + +Starting with Spark 2.4.0, it is possible to run Spark applications on Kubernetes in client mode. When running a Spark +application in client mode, a separate pod is not deployed to run the driver. When running an application in --- End diff -- could we add a bit here after `a separate pod is not deployed to run the driver` to say that the client/driver could be outside k8s or in k8s/in a pod? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21758: [SPARK-24795][CORE] Implement barrier execution m...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21758#discussion_r202533903 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -359,17 +368,49 @@ private[spark] class TaskSchedulerImpl( // of locality levels so that it gets a chance to launch local tasks on all of them. // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY for (taskSet <- sortedTaskSets) { - var launchedAnyTask = false - var launchedTaskAtCurrentMaxLocality = false - for (currentMaxLocality <- taskSet.myLocalityLevels) { -do { - launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet( -taskSet, currentMaxLocality, shuffledOffers, availableCpus, tasks) - launchedAnyTask |= launchedTaskAtCurrentMaxLocality -} while (launchedTaskAtCurrentMaxLocality) - } - if (!launchedAnyTask) { -taskSet.abortIfCompletelyBlacklisted(hostToExecutors) + // Skip the barrier taskSet if the available slots are less than the number of pending tasks. + if (taskSet.isBarrier && availableSlots < taskSet.numTasks) { +// Skip the launch process. +logInfo(s"Skip current round of resource offers for barrier stage ${taskSet.stageId} " + + s"because the barrier taskSet requires ${taskSet.numTasks} slots, while the total " + + s"number of available slots is ${availableSlots}.") + } else { +var launchedAnyTask = false +var launchedTaskAtCurrentMaxLocality = false +// Record all the executor IDs assigned barrier tasks on. +val hosts = ArrayBuffer[String]() +val taskDescs = ArrayBuffer[TaskDescription]() +for (currentMaxLocality <- taskSet.myLocalityLevels) { + do { +launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet, + currentMaxLocality, shuffledOffers, availableCpus, tasks, hosts, taskDescs) +launchedAnyTask |= launchedTaskAtCurrentMaxLocality + } while (launchedTaskAtCurrentMaxLocality) +} +if (!launchedAnyTask) { + taskSet.abortIfCompletelyBlacklisted(hostToExecutors) +} +if (launchedAnyTask && taskSet.isBarrier) { + // Check whether the barrier tasks are partially launched. + // TODO handle the assert failure case (that can happen when some locality requirements + // are not fulfilled, and we should revert the launched tasks) + require(taskDescs.size == taskSet.numTasks, +s"Skip current round of resource offers for barrier stage ${taskSet.stageId} " + + s"because only ${taskDescs.size} out of a total number of ${taskSet.numTasks} " + + "tasks got resource offers. The resource offers may have been blacklisted or " + + "cannot fulfill task locality requirements.") --- End diff -- how many attempts - would it fail continuously if some hosts are blacklisted? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21758: [SPARK-24795][CORE] Implement barrier execution m...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21758#discussion_r202533477 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1386,29 +1418,90 @@ class DAGScheduler( ) } } - // Mark the map whose fetch failed as broken in the map stage - if (mapId != -1) { -mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress) - } +} - // TODO: mark the executor as failed only if there were lots of fetch failures on it - if (bmAddress != null) { -val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled && - unRegisterOutputOnHostOnFetchFailure) { - // We had a fetch failure with the external shuffle service, so we - // assume all shuffle data on the node is bad. - Some(bmAddress.host) -} else { - // Unregister shuffle data just for one executor (we don't have any - // reason to believe shuffle data has been lost for the entire host). - None + case failure: TaskFailedReason if task.isBarrier => +// Also handle the task failed reasons here. +failure match { + case Resubmitted => +logInfo("Resubmitted " + task + ", so marking it as still running") +stage match { + case sms: ShuffleMapStage => +sms.pendingPartitions += task.partitionId + + case _ => +assert(false, "TaskSetManagers should only send Resubmitted task statuses for " + + "tasks in ShuffleMapStages.") } -removeExecutorAndUnregisterOutputs( - execId = bmAddress.executorId, - fileLost = true, - hostToUnregisterOutputs = hostToUnregisterOutputs, - maybeEpoch = Some(task.epoch)) + + case _ => // Do nothing. +} + +// Always fail the current stage and retry all the tasks when a barrier task fail. +val failedStage = stageIdToStage(task.stageId) +logInfo(s"Marking $failedStage (${failedStage.name}) as failed due to a barrier task " + + "failed.") +val message = "Stage failed because a barrier task finished unsuccessfully. " + + s"${failure.toErrorString}" +try { + // cancelTasks will fail if a SchedulerBackend does not implement killTask + taskScheduler.cancelTasks(stageId, interruptThread = false) +} catch { + case e: UnsupportedOperationException => +// Cannot continue with barrier stage if failed to cancel zombie barrier tasks. +logInfo(s"Could not cancel tasks for stage $stageId", e) --- End diff -- logWarn? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21758: [SPARK-24795][CORE] Implement barrier execution m...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21758#discussion_r202533650 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -359,17 +368,49 @@ private[spark] class TaskSchedulerImpl( // of locality levels so that it gets a chance to launch local tasks on all of them. // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY for (taskSet <- sortedTaskSets) { - var launchedAnyTask = false - var launchedTaskAtCurrentMaxLocality = false - for (currentMaxLocality <- taskSet.myLocalityLevels) { -do { - launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet( -taskSet, currentMaxLocality, shuffledOffers, availableCpus, tasks) - launchedAnyTask |= launchedTaskAtCurrentMaxLocality -} while (launchedTaskAtCurrentMaxLocality) - } - if (!launchedAnyTask) { -taskSet.abortIfCompletelyBlacklisted(hostToExecutors) + // Skip the barrier taskSet if the available slots are less than the number of pending tasks. + if (taskSet.isBarrier && availableSlots < taskSet.numTasks) { +// Skip the launch process. +logInfo(s"Skip current round of resource offers for barrier stage ${taskSet.stageId} " + + s"because the barrier taskSet requires ${taskSet.numTasks} slots, while the total " + + s"number of available slots is ${availableSlots}.") --- End diff -- this could get starved forever? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21758: [SPARK-24795][CORE] Implement barrier execution m...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21758#discussion_r202535313 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1386,29 +1418,90 @@ class DAGScheduler( ) } } - // Mark the map whose fetch failed as broken in the map stage - if (mapId != -1) { -mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress) - } +} - // TODO: mark the executor as failed only if there were lots of fetch failures on it - if (bmAddress != null) { -val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled && - unRegisterOutputOnHostOnFetchFailure) { - // We had a fetch failure with the external shuffle service, so we - // assume all shuffle data on the node is bad. - Some(bmAddress.host) -} else { - // Unregister shuffle data just for one executor (we don't have any - // reason to believe shuffle data has been lost for the entire host). - None + case failure: TaskFailedReason if task.isBarrier => +// Also handle the task failed reasons here. +failure match { + case Resubmitted => +logInfo("Resubmitted " + task + ", so marking it as still running") +stage match { + case sms: ShuffleMapStage => +sms.pendingPartitions += task.partitionId + + case _ => +assert(false, "TaskSetManagers should only send Resubmitted task statuses for " + + "tasks in ShuffleMapStages.") } -removeExecutorAndUnregisterOutputs( - execId = bmAddress.executorId, - fileLost = true, - hostToUnregisterOutputs = hostToUnregisterOutputs, - maybeEpoch = Some(task.epoch)) + + case _ => // Do nothing. +} + +// Always fail the current stage and retry all the tasks when a barrier task fail. +val failedStage = stageIdToStage(task.stageId) +logInfo(s"Marking $failedStage (${failedStage.name}) as failed due to a barrier task " + + "failed.") +val message = "Stage failed because a barrier task finished unsuccessfully. " + + s"${failure.toErrorString}" --- End diff -- add task id of the failed barrier task? it would make it easier to root cause/find the error --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21556 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/963/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21556 **[Test build #93017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93017/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20629 **[Test build #93016 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93016/testReport)** for PR 20629 at commit [`926c353`](https://github.com/apache/spark/commit/926c35309e39b9137f6637a79f64bd22f6da84e0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202536141 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- @HyukjinKwon Our support receives a few "bug" reports per months. For now we can provide a link to the note at least. The warning itself is needed to our support engineers to detect such kind of problems from logs of already finished jobs. Actually customers do not say in their bug reports that files/jars weren't overwritten (it would be easier). They report problems like calling a method from a lib crashes due to incompatible signature of method or a class doesn't exists. Or final result of a Spark job is not correct because a config/resource files added via `addFile()` is not up to date. Now I can detect the situation from logs and provide a link to docs for `addFile()`/`addJar()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20629: [SPARK-23451][ML] Deprecate KMeans.computeCost
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20629 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93016/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21711: [SPARK-24681][SQL] Verify nested column names in ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21711#discussion_r202536655 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -138,17 +138,35 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } /** - * Checks the validity of data column names. Hive metastore disallows the table to use comma in - * data column names. Partition columns do not have such a restriction. Views do not have such - * a restriction. + * Checks the validity of data column names. Hive metastore disallows the table to use some + * special characters (',', ':', and ';') in data column names. Partition columns do not have --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21711: [SPARK-24681][SQL] Verify nested column names in ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21711#discussion_r202536673 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2005,6 +2005,24 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("SPARK-24681 checks if nested column names do not include ',', ':', and ';'") { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21711: [SPARK-24681][SQL] Verify nested column names in ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21711#discussion_r202536743 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -138,17 +138,35 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } /** - * Checks the validity of data column names. Hive metastore disallows the table to use comma in - * data column names. Partition columns do not have such a restriction. Views do not have such - * a restriction. + * Checks the validity of data column names. Hive metastore disallows the table to use some + * special characters (',', ':', and ';') in data column names. Partition columns do not have + * such a restriction. Views do not have such a restriction. */ private def verifyDataSchema( tableName: TableIdentifier, tableType: CatalogTableType, dataSchema: StructType): Unit = { if (tableType != VIEW) { - dataSchema.map(_.name).foreach { colName => -if (colName.contains(",")) { - throw new AnalysisException("Cannot create a table having a column whose name contains " + -s"commas in Hive metastore. Table: $tableName; Column: $colName") + val invalidChars = Seq(",", ":", ";") + def verifyNestedColumnNames(schema: StructType): Unit = schema.foreach { f => +f.dataType match { + case st: StructType => verifyNestedColumnNames(st) + case _ if invalidChars.exists(f.name.contains) => +throw new AnalysisException("Cannot create a table having a nested column whose name " + + s"contains invalid characters (${invalidChars.map(c => s"'$c'").mkString(", ")}) " + --- End diff -- oh.. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93018 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93018/testReport)** for PR 21711 at commit [`fa0233e`](https://github.com/apache/spark/commit/fa0233e78b48aae0caac80d74e7e6dfd061d4c5f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/964/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93019 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93019/testReport)** for PR 21711 at commit [`9fabeef`](https://github.com/apache/spark/commit/9fabeeff2aba46ea512ad28464b1140cd59f361b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/965/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93020/testReport)** for PR 21711 at commit [`482a0c0`](https://github.com/apache/spark/commit/482a0c0b15027c6986070c94c0bf3a967206f792). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93021 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93021/testReport)** for PR 21711 at commit [`37c9ce3`](https://github.com/apache/spark/commit/37c9ce325cc5a654b98dba72fd62eaee0539ab5a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/966/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93022/testReport)** for PR 21711 at commit [`424ecba`](https://github.com/apache/spark/commit/424ecba1ea051a254491872e28e30479a48256cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/967/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21769 **[Test build #93023 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93023/testReport)** for PR 21769 at commit [`bb1098f`](https://github.com/apache/spark/commit/bb1098f0143d3552d51ab5343e36819850330b81). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93024/testReport)** for PR 21711 at commit [`8a6465b`](https://github.com/apache/spark/commit/8a6465b2a62d8404820872a452682cc464cc37ad). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/968/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21706: [SPARK-24702] Fix Unable to cast to calendar interval in...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21706 hey @gatorsmile, sorry to bother, could you just clarify the above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21711: [SPARK-24681][SQL] Verify nested column names in ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21711#discussion_r202537869 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -138,17 +138,36 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } /** - * Checks the validity of data column names. Hive metastore disallows the table to use comma in - * data column names. Partition columns do not have such a restriction. Views do not have such - * a restriction. + * Checks the validity of data column names. Hive metastore disallows the table to use some + * special characters (',', ':', and ';') in data column names, including nested column names. + * Partition columns do not have such a restriction. Views do not have such a restriction. */ private def verifyDataSchema( tableName: TableIdentifier, tableType: CatalogTableType, dataSchema: StructType): Unit = { if (tableType != VIEW) { - dataSchema.map(_.name).foreach { colName => -if (colName.contains(",")) { - throw new AnalysisException("Cannot create a table having a column whose name contains " + -s"commas in Hive metastore. Table: $tableName; Column: $colName") + val invalidChars = Seq(",", ":", ";") + def verifyNestedColumnNames(schema: StructType): Unit = schema.foreach { f => +f.dataType match { + case st: StructType => verifyNestedColumnNames(st) + case _ if invalidChars.exists(f.name.contains) => +val errMsg = "Cannot create a table having a nested column whose name contains " + + s"invalid characters (${invalidChars.map(c => s"'$c'").mkString(", ")}) " + --- End diff -- This is a weird red highlight...the syntax seems to be correct to me (also, the test passed). Anything you know? @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93025/testReport)** for PR 21711 at commit [`b298522`](https://github.com/apache/spark/commit/b298522947fc70337131cdb6b8d0c1e6299eedd3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/969/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21769 **[Test build #93023 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93023/testReport)** for PR 21769 at commit [`bb1098f`](https://github.com/apache/spark/commit/bb1098f0143d3552d51ab5343e36819850330b81). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21769 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93023/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21769 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21657 **[Test build #93015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93015/testReport)** for PR 21657 at commit [`81b3971`](https://github.com/apache/spark/commit/81b397140486fab7f7c2f7dcb15d5a9a62c99845). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93015/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/21766 Just checked out the PR, ```scala scala> spark.sql("SELECT CAST(1 as NUMERIC)") res0: org.apache.spark.sql.DataFrame = [CAST(1 AS DECIMAL(10,0)): decimal(10,0)] scala> spark.sql("SELECT NUMERIC(1)") org.apache.spark.sql.AnalysisException: Undefined function: 'NUMERIC'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 ``` I imagine some tests could be added here: - `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DataTypeParserSuite.scala` - `sql/core/src/test/resources/sql-tests/inputs/` Do you think it's worth having a separate DataType or just have it as an alias? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202538924 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -46,7 +47,23 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) protected def fixedPoint = FixedPoint(SQLConf.get.optimizerMaxIterations) - def batches: Seq[Batch] = { + protected def postAnalysisBatches: Seq[Batch] = { +Batch("Eliminate Distinct", Once, EliminateDistinct) :: +// Technically some of the rules in Finish Analysis are not optimizer rules and belong more +// in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime). +// However, because we also use the analyzer to canonicalized queries (for view definition), --- End diff -- "to canonicalized" -> "to canonicalize" ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202539342 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -175,6 +179,35 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) * Override to provide additional rules for the operator optimization batch. */ def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = Nil + + override def batches: Seq[Batch] = { +val excludedRules = + SQLConf.get.optimizerExcludedRules.toSeq.flatMap(_.split(",").map(_.trim).filter(!_.isEmpty)) +val filteredOptimizationBatches = if (excludedRules.isEmpty) { + optimizationBatches +} else { + optimizationBatches.flatMap { batch => +val filteredRules = + batch.rules.filter { rule => +val exclude = excludedRules.contains(rule.ruleName) +if (exclude) { + logInfo(s"Optimization rule '${rule.ruleName}' is excluded from the optimizer.") +} +!exclude + } +if (batch.rules == filteredRules) { --- End diff -- My understanding is that it is written that way to allow for logging --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202539784 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -127,6 +127,14 @@ object SQLConf { } } + val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules") +.doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " + + "specified by their rule names and separated by comma. It is not guaranteed that all the " + + "rules in this configuration will eventually be excluded, as some rules are necessary " + --- End diff -- I don't understand the optimizer at a low level (I'd be one of those users for which it is a blackbox), would you think it would be feasible to enumerate the rules that cannot be excluded ? Maybe even logging a WARNING when validating the config parameters if it contains required rules --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21764: [SPARK-24802] Optimization Rule Exclusion
Github user dmateusp commented on a diff in the pull request: https://github.com/apache/spark/pull/21764#discussion_r202539843 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerRuleExclusionSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.internal.SQLConf.OPTIMIZER_EXCLUDED_RULES + + +class OptimizerRuleExclusionSuite extends PlanTest { --- End diff -- Any test case for when a required rule is being passed as a "to be excluded" rule ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93019 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93019/testReport)** for PR 21711 at commit [`9fabeef`](https://github.com/apache/spark/commit/9fabeeff2aba46ea512ad28464b1140cd59f361b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93019/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93018 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93018/testReport)** for PR 21711 at commit [`fa0233e`](https://github.com/apache/spark/commit/fa0233e78b48aae0caac80d74e7e6dfd061d4c5f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93018/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93022/testReport)** for PR 21711 at commit [`424ecba`](https://github.com/apache/spark/commit/424ecba1ea051a254491872e28e30479a48256cb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93022/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21589 > AFAIK, we always have num of executor ... Not in all cases, Databricks clients can create auto-scaling clusters: https://docs.databricks.com/user-guide/clusters/sizing.html#cluster-size-and-autoscaling . For such cluster, we cannot get size of cluster in term of cores via config parameters. We need methods that could return current state of a cluster. Any static configs don't work here because it leads to overloaded or underloaded clusters. > ... and then num of core per executor right? In general, number of cores per executor could be different. I don't think it is good idea to force user to perform complex calculation to get number of cores available in a cluster. > maybe we should have the getter factored the same way and probably named and described/documented similarly @felixcheung I am not sure that our users are so interested in getting a list of cores per executors and calculate total numbers cores by summurizing the list. It will just complicate API and implementation, from my point of view. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...
GitHub user liutang123 opened a pull request: https://github.com/apache/spark/pull/21772 [SPARK-24809] [SQL] Serializing LongHashedRelation in executor may result in data error When join key is long or int in broadcast join, Spark will use LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if the broadcast value is abnormal big, executor will serialize it to disk. But, data will lost when serializing. A flow chart [see](http://oi67.tinypic.com/2z5pzs7.jpg) ## What changes were proposed in this pull request? Write cursor instead when serializing and setting cursor value when deserializing. ## How was this patch tested? manual test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liutang123/spark SPARK-24809 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21772.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21772 commit a72fe61863e119c0e902cef3054d9140b6d04f77 Author: liulijia Date: 2018-07-15T11:24:55Z [SPARK-24809] [SQL] Serializing LongHashedRelation in executor may result in data error --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93025/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93025/testReport)** for PR 21711 at commit [`b298522`](https://github.com/apache/spark/commit/b298522947fc70337131cdb6b8d0c1e6299eedd3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21772 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21772 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21772 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202541358 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -680,12 +689,22 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { Files.createFile(new File(tempSaveDir, "non-avro").toPath) - val newDf = spark -.read -.option(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true") -.avro(tempSaveDir) + val count = try { --- End diff -- Nit: consider writing the `try...finally` like this: ``` val hadoopConf = spark.sqlContext.sparkContext.hadoopConfiguration try { hadoopConf.set(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, "true") val count = spark.read.avro(tempSaveDir).count() assert(count == 8) } finally { hadoopConf.unset(AvroFileFormat.IgnoreFilesWithoutExtensionProperty) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93020/testReport)** for PR 21711 at commit [`482a0c0`](https://github.com/apache/spark/commit/482a0c0b15027c6986070c94c0bf3a967206f792). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93020/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93021 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93021/testReport)** for PR 21711 at commit [`37c9ce3`](https://github.com/apache/spark/commit/37c9ce325cc5a654b98dba72fd62eaee0539ab5a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93021/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21762: [SPARK-24800][SQL] Refactor Avro Serializer and Deserial...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21762 **[Test build #93026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93026/testReport)** for PR 21762 at commit [`aa5e79e`](https://github.com/apache/spark/commit/aa5e79ec67fbe0be54678a75258cb1b02cf5c9c1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21556 **[Test build #93017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93017/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaSummarizerExample ` * ` class SerializableConfiguration(@transient var value: Configuration)` * ` class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)` * ` case class SchemaType(dataType: DataType, nullable: Boolean)` * ` implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) ` * ` implicit class AvroDataFrameReader(reader: DataFrameReader) ` * `class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],` * `trait ComplexTypeMergingExpression extends Expression ` * `case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes ` * `abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast ` * `case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93017/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21762: [SPARK-24800][SQL] Refactor Avro Serializer and Deserial...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21762 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21762: [SPARK-24800][SQL] Refactor Avro Serializer and Deserial...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21762 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/970/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21711 **[Test build #93024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93024/testReport)** for PR 21711 at commit [`8a6465b`](https://github.com/apache/spark/commit/8a6465b2a62d8404820872a452682cc464cc37ad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21711: [SPARK-24681][SQL] Verify nested column names in Hive me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93024/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21762: [SPARK-24800][SQL] Refactor Avro Serializer and Deserial...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21762 **[Test build #93026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93026/testReport)** for PR 21762 at commit [`aa5e79e`](https://github.com/apache/spark/commit/aa5e79ec67fbe0be54678a75258cb1b02cf5c9c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21762: [SPARK-24800][SQL] Refactor Avro Serializer and Deserial...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21762 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93026/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org