[GitHub] spark pull request #20631: [SPARK-23454][SS][DOCS] Added trigger information...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20631#discussion_r168941119 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -1979,6 +2004,157 @@ which has methods that get called whenever there is a sequence of rows generated - Whenever `open` is called, `close` will also be called (unless the JVM exits due to some error). This is true even if `open` returns false. If there is any error in processing and writing the data, `close` will be called with the error. It is your responsibility to clean up state (e.g. connections, transactions, etc.) that have been created in `open` such that there are no resource leaks. + Triggers +The trigger settings of a streaming query defines the timing of streaming data processing, whether +the query is going to executed as micro-batch query with a fixed batch interval or as a continuous processing query. +Here are the different kinds of triggers that are supported. + + + +Trigger Type +Description + + +unspecified (default) + +If no trigger setting is explicitly specified, then by default, the query will be +executed in micro-batch mode, where micro-batches will be generated as soon as +the previous micro-batch has completed processing. + + + +Fixed interval micro-batches + +The query will be executed with micro-batches mode, where micro-batches will be kicked off +at the user-specified intervals. + + If the previous micro-batch completes within the interval, then the engine will wait until + the interval is over before kicking off the next micro-batch. + + If the previous micro-batch takes longer than the interval to complete (i.e. if an + interval boundary is missed), then the next micro-batch will start as soon as the + previous one completes (i.e., it will not wait for the next interval boundary). + + If no new data is available, then no micro-batch will be kicked off. + + + + +One-time micro-batch + +The query will execute *only one* micro-batch to process all the available data and then +stop on its own. This is useful in scenarios you want to periodically spin up a cluster, +process everything that is available since the last period, and then the shutdown the +cluster. In some case, this may lead to significant cost savings. + + + +Continuous with fixed checkpoint interval(experimental) + +The query will be executed in the new low-latency, continuous processing mode. Read more +about this in the Continuous Processing section below. + + + + +Here are a few code examples. + + + + +{% highlight scala %} +import org.apache.spark.sql.streaming.Trigger + +// Default trigger (runs micro-batch as soon as it can) +df.writeStream + .format("console") + .start() + +// ProcessingTime trigger with two-second micro-batch interval +df.writeStream + .format("console") + .trigger(Trigger.ProcessingTime("2 seconds")) + .start() + +// One-time trigger +df.writeStream + .format("console") + .trigger(Trigger.Once()) + .start() + +// Continuous trigger with one-second checkpointing interval +df.writeStream + .format("console") + .trigger(Trigger.Continuous()) + .start() + +{% endhighlight %} + + + + + +{% highlight java %} +import org.apache.spark.sql.streaming.Trigger + +// Default trigger (runs micro-batch as soon as it can) +df.writeStream + .format("console") + .start(); + +// ProcessingTime trigger with two-second micro-batch interval +df.writeStream + .format("console") + .trigger(Trigger.ProcessingTime("2 seconds")) + .start(); + +// One-time trigger +df.writeStream + .format("console") + .trigger(Trigger.Once()) + .start(); + +// Continuous trigger with one-second checkpointing interval +df.writeStream + .format("console") + .trigger(Trigger.Continuous()) + .start(); + +{% endhighlight %} + + + + +{% highlight python %} + +# Default trigger (runs micro-batch as soon as it can) +df.writeStream \ + .format("console") \ + .start() + +# ProcessingTime trigger with two-second micro-batch interval +df.writeStream \ + .format("console") \ + .trigger(processingTime='2 seconds') \ + .start() + +# One-time trigger +df.writeStream \ + .format("console") \ + .trigge
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20619 **[Test build #87534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87534/testReport)** for PR 20619 at commit [`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/953/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20619 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87533/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20619 **[Test build #87533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87533/testReport)** for PR 20619 at commit [`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][ML]Update multi-column error handling logi...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/20442 Thanks for the comments. I am in China now for Chinese New Year. Will address the comments when I get back to work on 2/21. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20387 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87532/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20387 **[Test build #87532 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87532/testReport)** for PR 20387 at commit [`1a603db`](https://github.com/apache/spark/commit/1a603dbe5528b447bff371d2e00abdbdee664a75). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20619 **[Test build #87533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87533/testReport)** for PR 20619 at commit [`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/952/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r168936300 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -640,4 +689,96 @@ private object RandomForestSuite { val (indices, values) = map.toSeq.sortBy(_._1).unzip Vectors.sparse(size, indices.toArray, values.toArray) } + + /** Generate a label. */ + private def generateLabel(rnd: Random, numClasses: Int): Double = { +rnd.nextInt(numClasses) + } + + /** Generate a numeric value in the range [numericMin, numericMax]. */ + private def generateNumericValue(rnd: Random, numericMin: Double, numericMax: Double) : Double = { +rnd.nextDouble() * (Math.abs(numericMax) + Math.abs(numericMin)) + numericMin + } + + /** Generate a binary value. */ + private def generateBinaryValue(rnd: Random) : Double = if (rnd.nextBoolean()) 1 else 0 + + /** Generate an array of binary values of length numBinary. */ + private def generateBinaryArray(rnd: Random, numBinary: Int): Array[Double] = { +Range.apply(0, numBinary).map(_ => generateBinaryValue(rnd)).toArray + } + + /** Generate an array of binary values of length numNumeric in the range +* [numericMin, numericMax]. */ + private def generateNumericArray(rnd: Random, + numNumeric: Int, + numericMin: Double, + numericMax: Double) : Array[Double] = { +Range.apply(0, numNumeric).map(_ => generateNumericValue(rnd, numericMin, numericMax)).toArray + } + + /** Generate a LabeledPoint with numNumeric numeric values followed by numBinary binary values. */ + private def generatePoint(rnd: Random, +numClasses: Int, +numNumeric: Int = 10, +numericMin: Double = -100, +numericMax: Double = 100, +numBinary: Int = 10): LabeledPoint = { +val label = generateLabel(rnd, numClasses) +val numericArray = generateNumericArray(rnd, numNumeric, numericMin, numericMax) +val binaryArray = generateBinaryArray(rnd, numBinary) +val vector = Vectors.dense(numericArray ++ binaryArray) + +new LabeledPoint(label, vector) + } + + /** Data for tree redundancy tests which produces a non-trivial tree. */ + private def generateRedundantPoints(sc: SparkContext, + numClasses: Int, + numPoints: Int, + rnd: Random): RDD[LabeledPoint] = { +sc.parallelize(Range.apply(1, numPoints) + .map(_ => generatePoint(rnd, numClasses = numClasses, numNumeric = 0, numBinary = 3))) + } + + /** +* Returns true iff the decision tree has at least one subtree that can be pruned +* (i.e., all its leaves share the same prediction). +* @param tree the tree to be tested +* @return true iff the decision tree has at least one subtree that can be pruned. +*/ + private def isRedundant(tree: DecisionTreeModel): Boolean = _isRedundant(tree.rootNode) + + /** +* Returns true iff the node has at least one subtree that can be pruned +* (i.e., all its leaves share the same prediction). +* @param n the node to be tested +* @return true iff the node has at least one subtree that can be pruned. +*/ + private def _isRedundant(n: Node): Boolean = n match { +case n: InternalNode => + _isRedundant(n.leftChild) || _isRedundant(n.leftChild) || canBePruned(n) +case _ => false + } + + /** +* Returns true iff the subtree rooted at the given node can be pruned +* (i.e., all its leaves share the same prediction). +* @param n the node to be tested +* @return returns true iff the subtree rooted at the given node can be pruned. +*/ + private def canBePruned(n: Node): Boolean = n match { +case n: InternalNode => + (leafPredictions(n.leftChild) ++ leafPredictions(n.rightChild)).size == 1 +case _ => false + } + + /** +* Given a node, the method returns the set of predictions appearing in the subtree rooted at it. +* @return the set of predictions appearing in the subtree rooted at the given node. +*/ + private def leafPredictions(n: Node): Set[Double] = n match { --- End diff -- I suppose I haven't read this code thoroughly, but rather than write similar logic differently in the test, to test the original logic, why not just construct cases with known correct answers and comp
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r168936292 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala --- @@ -287,6 +292,41 @@ private[tree] class LearningNode( } } + /** + * Method testing whether a node is a leaf. + * @return true iff a node is a leaf. + */ + private def isLeafNode(): Boolean = leftChild.isEmpty && rightChild.isEmpty + + /** True iff the node should be a leaf. */ + private lazy val shouldBeLeaf: Boolean = leafPredictions.size == 1 + + /** + * Returns the set of (leaf) predictions appearing in the subtree rooted at the considered node. + * @return the set of (leaf) predictions appearing in the subtree rooted at the given node. + */ + private def leafPredictions: Set[Double] = { --- End diff -- Yeah, and short-circuiting involves an `if` statement anyway, so maybe no value in my previous suggestion about `foreach`. Now this saves the whole `Set[Double]` of predictions when it's never used again. It's probably worth not hogging that memory. The method here doesn't actually need to form the set of all predictions, just decide whether there's more than 1. For example, in the case that there are no distinct values after evaluating children, no point in adding the leaf's prediction to it just to decide it's got 1 unique value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 The final failure is irrelevant to this. ``` org.apache.spark.sql.sources.CreateTableAsSelectSuite.(It is not a test it is a sbt.testing.SuiteSelector) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r168936195 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -640,4 +689,96 @@ private object RandomForestSuite { val (indices, values) = map.toSeq.sortBy(_._1).unzip Vectors.sparse(size, indices.toArray, values.toArray) } + + /** Generate a label. */ + private def generateLabel(rnd: Random, numClasses: Int): Double = { +rnd.nextInt(numClasses) + } + + /** Generate a numeric value in the range [numericMin, numericMax]. */ + private def generateNumericValue(rnd: Random, numericMin: Double, numericMax: Double) : Double = { +rnd.nextDouble() * (Math.abs(numericMax) + Math.abs(numericMin)) + numericMin --- End diff -- Oh right, disregard that. I had in mind that they were ints. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r168936124 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala --- @@ -287,6 +292,41 @@ private[tree] class LearningNode( } } + /** + * Method testing whether a node is a leaf. + * @return true iff a node is a leaf. + */ + private def isLeafNode(): Boolean = leftChild.isEmpty && rightChild.isEmpty + + /** True iff the node should be a leaf. */ + private lazy val shouldBeLeaf: Boolean = leafPredictions.size == 1 + + /** + * Returns the set of (leaf) predictions appearing in the subtree rooted at the considered node. + * @return the set of (leaf) predictions appearing in the subtree rooted at the given node. + */ + private def leafPredictions: Set[Double] = { + +val predBuffer = new scala.collection.mutable.HashSet[Double] + +// collect the (leaf) predictions in the left subtree, if any +if (leftChild.isDefined) { --- End diff -- No, I'm just thinking you use `Option.foreach(...)` twice. It's a matter of taste I guess. I tend to prefer avoiding the `isDefined` and `get` calls for simple one-liners. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20621#discussion_r168935309 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -407,6 +407,34 @@ object PartitioningUtils { Literal(bigDecimal) } +val dateTry = Try { + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // DateType + DateTimeUtils.getThreadLocalDateFormat.parse(raw) + // SPARK-23436: Casting the string to date may still return null if a bad Date is provided. + // This can happen since DateFormat.parse may not use the entire text of the given string: + // so if there are extra-characters after the date, it returns correctly. + // We need to check that we can cast the raw string since we later can use Cast to get + // the partition values with the right DataType (see + // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning) + val dateValue = Cast(Literal(raw), DateType).eval() + // Disallow DateType if the cast returned null + require(dateValue != null) + Literal.create(dateValue, DateType) +} + +val timestampTry = Try { + val unescapedRaw = unescapePathName(raw) + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // TimestampType + DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw) --- End diff -- `inferPartitioning` will use `PartitioningUtils.parsePartitions` to infer the partition spec if there is no `userPartitionSchema`. It is used by `DataSource.sourceSchema`. Seems this change makes the partition directory previously parsing-able now unable to parse. Will it change behavior of other code path? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87531/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20057 **[Test build #87531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87531/testReport)** for PR 20057 at commit [`0e646b2`](https://github.com/apache/spark/commit/0e646b20f0be1580cfb8565908c98b57ec248941). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20387 **[Test build #87532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87532/testReport)** for PR 20387 at commit [`1a603db`](https://github.com/apache/spark/commit/1a603dbe5528b447bff371d2e00abdbdee664a75). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20387 Thanks for the update! Enjoy your vacation, and thanks for letting me know. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20387 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/951/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20387#discussion_r168933227 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -17,130 +17,55 @@ package org.apache.spark.sql.execution.datasources.v2 -import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeMap, AttributeSet, Expression, NamedExpression, PredicateHelper} -import org.apache.spark.sql.catalyst.optimizer.RemoveRedundantProject +import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} import org.apache.spark.sql.catalyst.rules.Rule -import org.apache.spark.sql.execution.datasources.DataSourceStrategy -import org.apache.spark.sql.sources -import org.apache.spark.sql.sources.v2.reader._ -/** - * Pushes down various operators to the underlying data source for better performance. Operators are - * being pushed down with a specific order. As an example, given a LIMIT has a FILTER child, you - * can't push down LIMIT if FILTER is not completely pushed down. When both are pushed down, the - * data source should execute FILTER before LIMIT. And required columns are calculated at the end, - * because when more operators are pushed down, we may need less columns at Spark side. - */ -object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHelper { - override def apply(plan: LogicalPlan): LogicalPlan = { -// Note that, we need to collect the target operator along with PROJECT node, as PROJECT may -// appear in many places for column pruning. -// TODO: Ideally column pruning should be implemented via a plan property that is propagated -// top-down, then we can simplify the logic here and only collect target operators. -val filterPushed = plan transformUp { - case FilterAndProject(fields, condition, r @ DataSourceV2Relation(_, reader)) => -val (candidates, nonDeterministic) = - splitConjunctivePredicates(condition).partition(_.deterministic) - -val stayUpFilters: Seq[Expression] = reader match { - case r: SupportsPushDownCatalystFilters => -r.pushCatalystFilters(candidates.toArray) - - case r: SupportsPushDownFilters => -// A map from original Catalyst expressions to corresponding translated data source -// filters. If a predicate is not in this map, it means it cannot be pushed down. -val translatedMap: Map[Expression, sources.Filter] = candidates.flatMap { p => - DataSourceStrategy.translateFilter(p).map(f => p -> f) -}.toMap - -// Catalyst predicate expressions that cannot be converted to data source filters. -val nonConvertiblePredicates = candidates.filterNot(translatedMap.contains) - -// Data source filters that cannot be pushed down. An unhandled filter means -// the data source cannot guarantee the rows returned can pass the filter. -// As a result we must return it so Spark can plan an extra filter operator. -val unhandledFilters = r.pushFilters(translatedMap.values.toArray).toSet -val unhandledPredicates = translatedMap.filter { case (_, f) => - unhandledFilters.contains(f) -}.keys - -nonConvertiblePredicates ++ unhandledPredicates - - case _ => candidates -} - -val filterCondition = (stayUpFilters ++ nonDeterministic).reduceLeftOption(And) -val withFilter = filterCondition.map(Filter(_, r)).getOrElse(r) -if (withFilter.output == fields) { - withFilter -} else { - Project(fields, withFilter) -} -} - -// TODO: add more push down rules. - -val columnPruned = pushDownRequiredColumns(filterPushed, filterPushed.outputSet) -// After column pruning, we may have redundant PROJECT nodes in the query plan, remove them. -RemoveRedundantProject(columnPruned) - } - - // TODO: nested fields pruning - private def pushDownRequiredColumns( - plan: LogicalPlan, requiredByParent: AttributeSet): LogicalPlan = { -plan match { - case p @ Project(projectList, child) => -val required = projectList.flatMap(_.references) -p.copy(child = pushDownRequiredColumns(child, AttributeSet(required))) - - case f @ Filter(condition, child) => -val required = requiredByParent
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87526/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20621 **[Test build #87526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87526/testReport)** for PR 20621 at commit [`8698f4d`](https://github.com/apache/spark/commit/8698f4de7b6e1b9453a12bc949ce8666d6322a87). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87528/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20619 **[Test build #87528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87528/testReport)** for PR 20619 at commit [`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20634 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87525/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20634 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20634 **[Test build #87525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87525/testReport)** for PR 20634 at commit [`bde6818`](https://github.com/apache/spark/commit/bde681816bc8251010aa14fd0f4fb27b732fd061). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87527/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20619 **[Test build #87527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87527/testReport)** for PR 20619 at commit [`e08d06c`](https://github.com/apache/spark/commit/e08d06c0e6c0cf23178d12baaa5eb00d55f9b456). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20057 **[Test build #87531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87531/testReport)** for PR 20057 at commit [`0e646b2`](https://github.com/apache/spark/commit/0e646b20f0be1580cfb8565908c98b57ec248941). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168930195 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala --- @@ -85,15 +85,24 @@ private object PostgresDialect extends JdbcDialect { s"SELECT 1 FROM $table LIMIT 1" } + override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + /** - * The SQL query used to truncate a table. For Postgres, the default behaviour is to - * also truncate any descendant tables. As this is a (possibly unwanted) side-effect, - * the Postgres dialect adds 'ONLY' to truncate only the table in question - * @param table The name of the table. - * @return The SQL query to use for truncating a table - */ - override def getTruncateQuery(table: String): String = { -s"TRUNCATE TABLE ONLY $table" + * The SQL query used to truncate a table. For Postgres, the default behaviour is to + * also truncate any descendant tables. As this is a (possibly unwanted) side-effect, + * the Postgres dialect adds 'ONLY' to truncate only the table in question + * @param table The table to truncate + * @param cascade Whether or not to cascade the truncation. Default value is the value of + *isCascadingTruncateTable() + * @return The SQL query to use for truncating a table + */ + override def getTruncateQuery( + table: String, + cascade: Option[Boolean] = isCascadingTruncateTable): String = { +cascade match { + case Some(true) => s"TRUNCATE TABLE ONLY $table CASCADE" --- End diff -- I believe you already these this in PostgreSQL. To make it sure for us about using these `ONLY` and `CASCADE` at the same time, could you share us the result of a manual test of this with PostgreSQL tables, descendant, and foreign-key referenced tables here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20057 **[Test build #87530 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87530/testReport)** for PR 20057 at commit [`7e09f90`](https://github.com/apache/spark/commit/7e09f90ab5625d95d060949a371f03afb50b15cb). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87530/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20057 **[Test build #87530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87530/testReport)** for PR 20057 at commit [`7e09f90`](https://github.com/apache/spark/commit/7e09f90ab5625d95d060949a371f03afb50b15cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user danielvdende commented on the issue: https://github.com/apache/spark/pull/20057 Thanks again @dongjoon-hyun, much appreciated ð. I've made changes accordingly, and added comments in this PR where appropriate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user danielvdende commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168929874 --- Diff: docs/sql-programming-guide.md --- @@ -1372,6 +1372,13 @@ the following case-insensitive options: This is a JDBC writer related option. When SaveMode.Overwrite is enabled, this option causes Spark to truncate an existing table instead of dropping and recreating it. This can be more efficient, and prevents the table metadata (e.g., indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. It defaults to false. This option applies only to writing. + + +cascadeTruncate + +This is a JDBC writer related option. If enabled and supported by the JDBC database (PostgreSQL and Oracle at the moment), this options allows execution of a TRUNCATE TABLE t CASCADE. This will affect other tables, and thus should be used with care. This option applies only to writing. It defaults to the default cascading truncate behaviour of the JDBC database in question, specified in the isCascadeTruncate in each JDBCDialect. --- End diff -- Yeah, so the "problem" there is that PostgreSQL has inheritance across tables (as we discovered in [SPARK-22729](https://github.com/apache/spark/pull/19911). To be completely transparent to users, I think the `ONLY` part of the query for PostgreSQL should also be configurable, but I think that's outside the scope of this PR. I've added a comment saying that for PostgreSQL a `ONLY` is added to prevent descendant tables from being truncated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20057 **[Test build #87529 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87529/testReport)** for PR 20057 at commit [`b79becd`](https://github.com/apache/spark/commit/b79becd188f67202dd23796456c512e544cd). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87529/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user danielvdende commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168929798 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala --- @@ -94,5 +94,21 @@ private case object OracleDialect extends JdbcDialect { case _ => value } + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable() + * @return The SQL query to use for truncating a table + */ + override def getTruncateQuery( --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user danielvdende commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168929792 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala --- @@ -42,4 +42,17 @@ private object MsSqlServerDialect extends JdbcDialect { } override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable(). Ignored for MsSql as it is unsupported. --- End diff -- Hmm, I disagree with this, the reason I added it was to show users that even if they specified `cascadeTruncate` being `true`, it doesn't take effect for some databases (like MsSQL). The `isCascadingTruncateTable` only shows the default behaviour, which for all databases at the moment is false. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20057 **[Test build #87529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87529/testReport)** for PR 20057 at commit [`b79becd`](https://github.com/apache/spark/commit/b79becd188f67202dd23796456c512e544cd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20619 **[Test build #87528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87528/testReport)** for PR 20619 at commit [`8bd02d8`](https://github.com/apache/spark/commit/8bd02d8692708ab58e31e19a3682af3a94550369). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/950/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20619: [SPARK-23457][SQL] Register task completion liste...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20619#discussion_r168929352 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -395,16 +395,21 @@ class ParquetFileFormat ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get) } val taskContext = Option(TaskContext.get()) - val parquetReader = if (enableVectorizedReader) { + if (enableVectorizedReader) { val vectorizedReader = new VectorizedParquetRecordReader( convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) +val iter = new RecordReaderIterator(vectorizedReader) +// SPARK-23457 Register a task completion lister before `initialization`. --- End diff -- Now, `SPARK-23457` is added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 Oh, @kiszk . The following meat really `comment` in the code. Sorry, I misunderstood. > Would it be worth to add this JIRA number in a comment as we did for ORC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20619 **[Test build #87527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87527/testReport)** for PR 20619 at commit [`e08d06c`](https://github.com/apache/spark/commit/e08d06c0e6c0cf23178d12baaa5eb00d55f9b456). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/949/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20619 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 Thank you, @kiszk . I added SPARK-23390 in the PR description. > Would it be worth to add this JIRA number in a comment as we did for ORC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/948/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20621 **[Test build #87526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87526/testReport)** for PR 20621 at commit [`8698f4d`](https://github.com/apache/spark/commit/8698f4de7b6e1b9453a12bc949ce8666d6322a87). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20621: [SPARK-23436][SQL] Infer partition as Date only if it ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20621 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20619 LGTM with one minor comment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 Thank you for review, @mgaido91 . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20057 I finished my second round. Could you update once more? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168928044 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala --- @@ -120,11 +121,13 @@ abstract class JdbcDialect extends Serializable { * The SQL query that should be used to truncate a table. Dialects can override this method to * return a query that is suitable for a particular database. For PostgreSQL, for instance, * a different query is used to prevent "TRUNCATE" affecting other tables. - * @param table The name of the table. + * @param table The table to truncate + * @param cascade Whether or not to cascade the truncation * @return The SQL query to use for truncating a table */ @Since("2.3.0") - def getTruncateQuery(table: String): String = { + def getTruncateQuery(table: String, + cascade: Option[Boolean] = isCascadingTruncateTable): String = { --- End diff -- For this one, we need to keep the original one for backward compatibility like the following? ``` @Since("2.3.0") def getTruncateQuery(table: String): String = { getTruncateQuery(table, isCascadingTruncateTable) } @Since("2.4.0") def getTruncateQuery(table: String, cascade: Option[Boolean] = isCascadingTruncateTable) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927868 --- Diff: docs/sql-programming-guide.md --- @@ -1372,6 +1372,13 @@ the following case-insensitive options: This is a JDBC writer related option. When SaveMode.Overwrite is enabled, this option causes Spark to truncate an existing table instead of dropping and recreating it. This can be more efficient, and prevents the table metadata (e.g., indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. It defaults to false. This option applies only to writing. + + +cascadeTruncate + +This is a JDBC writer related option. If enabled and supported by the JDBC database (PostgreSQL and Oracle at the moment), this options allows execution of a TRUNCATE TABLE t CASCADE. This will affect other tables, and thus should be used with care. This option applies only to writing. It defaults to the default cascading truncate behaviour of the JDBC database in question, specified in the isCascadeTruncate in each JDBCDialect. --- End diff -- Ur, for PostgreSQL, we are generating `TRUNCATE TABLE ONLY $table CASCADE`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927785 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala --- @@ -64,7 +64,16 @@ private class AggregatedDialect(dialects: List[JdbcDialect]) extends JdbcDialect } } - override def getTruncateQuery(table: String): String = { -dialects.head.getTruncateQuery(table) + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. --- End diff -- ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala --- @@ -94,5 +94,21 @@ private case object OracleDialect extends JdbcDialect { case _ => value } + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable() + * @return The SQL query to use for truncating a table + */ + override def getTruncateQuery( --- End diff -- Could you move this function after `isCascadingTruncateTable` like the other files? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927651 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala --- @@ -94,5 +94,21 @@ private case object OracleDialect extends JdbcDialect { case _ => value } + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927631 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala --- @@ -42,4 +42,17 @@ private object MsSqlServerDialect extends JdbcDialect { } override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable(). Ignored for MsSql as it is unsupported. --- End diff -- And, for all dialects, please remove it or move it consistently. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927633 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala --- @@ -46,4 +46,17 @@ private case object MySQLDialect extends JdbcDialect { } override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927608 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala --- @@ -42,4 +42,17 @@ private object MsSqlServerDialect extends JdbcDialect { } override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable(). Ignored for MsSql as it is unsupported. --- End diff -- I think we can ignore `Ignored for MsSql as it is unsupported.` here. If you want, you can make a function comment of `override def isCascadingTruncateTable(): Option[Boolean] = Some(false)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927597 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala --- @@ -42,4 +42,17 @@ private object MsSqlServerDialect extends JdbcDialect { } override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. --- End diff -- `The JDBCOptions`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23457][SQL] Register task completion listeners fi...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20619 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927546 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/DB2Dialect.scala --- @@ -49,4 +49,17 @@ private object DB2Dialect extends JdbcDialect { } override def isCascadingTruncateTable(): Option[Boolean] = Some(false) + + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. --- End diff -- `The JDBCOptions`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927495 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala --- @@ -120,11 +121,13 @@ abstract class JdbcDialect extends Serializable { * The SQL query that should be used to truncate a table. Dialects can override this method to * return a query that is suitable for a particular database. For PostgreSQL, for instance, * a different query is used to prevent "TRUNCATE" affecting other tables. - * @param table The name of the table. + * @param table The table to truncate + * @param cascade Whether or not to cascade the truncation * @return The SQL query to use for truncating a table */ @Since("2.3.0") - def getTruncateQuery(table: String): String = { + def getTruncateQuery(table: String, + cascade: Option[Boolean] = isCascadingTruncateTable): String = { --- End diff -- @danielvdende . Could you take a look the guide? This guide is helpful in Apache Spark community. - https://github.com/databricks/scala-style-guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927408 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/DerbyDialect.scala --- @@ -41,4 +41,16 @@ private object DerbyDialect extends JdbcDialect { Option(JdbcType("DECIMAL(31,5)", java.sql.Types.DECIMAL)) case _ => None } + + /** + * The SQL query used to truncate a table. + * @param table The JDBCOptions. + * @param cascade Whether or not to cascade the truncation. Default value is the + *value of isCascadingTruncateTable(). Ignored for Derby as it is unsupported. + * @return The SQL query to use for truncating a table + */ + override def getTruncateQuery(table: String, +cascade: Option[Boolean] = isCascadingTruncateTable): String = { --- End diff -- indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927400 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -102,7 +102,12 @@ object JdbcUtils extends Logging { val dialect = JdbcDialects.get(options.url) val statement = conn.createStatement try { - statement.executeUpdate(dialect.getTruncateQuery(options.table)) + if (options.isCascadeTruncate.isDefined) { +statement.executeUpdate(dialect.getTruncateQuery(options.table, + options.isCascadeTruncate)) + } else { +statement.executeUpdate(dialect.getTruncateQuery(options.table)) + } --- End diff -- I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20057: [SPARK-22880][SQL] Add cascadeTruncate option to ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20057#discussion_r168927406 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/DerbyDialect.scala --- @@ -41,4 +41,16 @@ private object DerbyDialect extends JdbcDialect { Option(JdbcType("DECIMAL(31,5)", java.sql.Types.DECIMAL)) case _ => None } + + /** + * The SQL query used to truncate a table. --- End diff -- Indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19775: [SPARK-22343][core] Add support for publishing Spark met...
Github user stoader commented on the issue: https://github.com/apache/spark/pull/19775 @erikerlandson we tested this on Kubernetes using https://github.com/prometheus/pushgateway/tree/v0.3.1 and https://github.com/prometheus/pushgateway/tree/v0.4.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20634 **[Test build #87525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87525/testReport)** for PR 20634 at commit [`bde6818`](https://github.com/apache/spark/commit/bde681816bc8251010aa14fd0f4fb27b732fd061). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23390][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 Yep. @kiszk . @mgaido91 also reports that, so I'm investigating that more. However, that doesn't mean this approach is not proper. You can see the manual test case example in previous ORC-related PR and this PR. This approach definitely reduces the number of point of failures. For the remaining issue, I think we may need a different approach in a different code path. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20619: [SPARK-23390][SQL] Register task completion listeners fi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20619 For the following, I'll. > Would it be worth to add this JIRA number in a comment as we did for ORC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20634 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/947/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20634 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20634: [SPARK-23456][SPARK-21783] Turn on `native` ORC i...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/20634 [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and PPD by default ## What changes were proposed in this pull request? Apache Spark 2.3 introduced `native` ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables `native` ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code. ## How was this patch tested? Pass the Jenkins with existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-23456 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20634.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20634 commit bde681816bc8251010aa14fd0f4fb27b732fd061 Author: Dongjoon Hyun Date: 2018-02-17T16:56:31Z [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and PPD by default --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20621#discussion_r168925882 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -407,6 +407,34 @@ object PartitioningUtils { Literal(bigDecimal) } +val dateTry = Try { + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // DateType + DateTimeUtils.getThreadLocalDateFormat.parse(raw) + // SPARK-23436: Casting the string to date may still return null if a bad Date is provided. + // This can happen since DateFormat.parse may not use the entire text of the given string: + // so if there are extra-characters after the date, it returns correctly. + // We need to check that we can cast the raw string since we later can use Cast to get + // the partition values with the right DataType (see + // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning) + val dateValue = Cast(Literal(raw), DateType).eval() + // Disallow DateType if the cast returned null + require(dateValue != null) + Literal.create(dateValue, DateType) +} + +val timestampTry = Try { + val unescapedRaw = unescapePathName(raw) + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // TimestampType + DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw) --- End diff -- sorry, I am not sure I got 100% your question, may you elaborate it a bit more please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20057 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87524/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20057 **[Test build #87524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87524/testReport)** for PR 20057 at commit [`3921368`](https://github.com/apache/spark/commit/39213688b9389e1a2aa4763a8a09c3d1ca402e6e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20621: [SPARK-23436][SQL] Infer partition as Date only i...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20621#discussion_r168925339 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -407,6 +407,34 @@ object PartitioningUtils { Literal(bigDecimal) } +val dateTry = Try { + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // DateType + DateTimeUtils.getThreadLocalDateFormat.parse(raw) + // SPARK-23436: Casting the string to date may still return null if a bad Date is provided. + // This can happen since DateFormat.parse may not use the entire text of the given string: + // so if there are extra-characters after the date, it returns correctly. + // We need to check that we can cast the raw string since we later can use Cast to get + // the partition values with the right DataType (see + // org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning) + val dateValue = Cast(Literal(raw), DateType).eval() + // Disallow DateType if the cast returned null + require(dateValue != null) + Literal.create(dateValue, DateType) +} + +val timestampTry = Try { + val unescapedRaw = unescapePathName(raw) + // try and parse the date, if no exception occurs this is a candidate to be resolved as + // TimestampType + DateTimeUtils.getThreadLocalTimestampFormat(timeZone).parse(unescapedRaw) --- End diff -- Since this changes the behavior of `PartitioningUtils.parsePartitions`, doesn't it change the result of another path in `inferPartitioning`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...
Github user asolimando commented on a diff in the pull request: https://github.com/apache/spark/pull/20632#discussion_r168925267 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -640,4 +689,96 @@ private object RandomForestSuite { val (indices, values) = map.toSeq.sortBy(_._1).unzip Vectors.sparse(size, indices.toArray, values.toArray) } + + /** Generate a label. */ + private def generateLabel(rnd: Random, numClasses: Int): Double = { +rnd.nextInt(numClasses) + } + + /** Generate a numeric value in the range [numericMin, numericMax]. */ + private def generateNumericValue(rnd: Random, numericMin: Double, numericMax: Double) : Double = { +rnd.nextDouble() * (Math.abs(numericMax) + Math.abs(numericMin)) + numericMin + } + + /** Generate a binary value. */ + private def generateBinaryValue(rnd: Random) : Double = if (rnd.nextBoolean()) 1 else 0 + + /** Generate an array of binary values of length numBinary. */ + private def generateBinaryArray(rnd: Random, numBinary: Int): Array[Double] = { +Range.apply(0, numBinary).map(_ => generateBinaryValue(rnd)).toArray + } + + /** Generate an array of binary values of length numNumeric in the range +* [numericMin, numericMax]. */ + private def generateNumericArray(rnd: Random, + numNumeric: Int, + numericMin: Double, + numericMax: Double) : Array[Double] = { +Range.apply(0, numNumeric).map(_ => generateNumericValue(rnd, numericMin, numericMax)).toArray + } + + /** Generate a LabeledPoint with numNumeric numeric values followed by numBinary binary values. */ + private def generatePoint(rnd: Random, +numClasses: Int, +numNumeric: Int = 10, +numericMin: Double = -100, +numericMax: Double = 100, +numBinary: Int = 10): LabeledPoint = { +val label = generateLabel(rnd, numClasses) +val numericArray = generateNumericArray(rnd, numNumeric, numericMin, numericMax) +val binaryArray = generateBinaryArray(rnd, numBinary) +val vector = Vectors.dense(numericArray ++ binaryArray) + +new LabeledPoint(label, vector) + } + + /** Data for tree redundancy tests which produces a non-trivial tree. */ + private def generateRedundantPoints(sc: SparkContext, + numClasses: Int, + numPoints: Int, + rnd: Random): RDD[LabeledPoint] = { +sc.parallelize(Range.apply(1, numPoints) + .map(_ => generatePoint(rnd, numClasses = numClasses, numNumeric = 0, numBinary = 3))) + } + + /** +* Returns true iff the decision tree has at least one subtree that can be pruned +* (i.e., all its leaves share the same prediction). +* @param tree the tree to be tested +* @return true iff the decision tree has at least one subtree that can be pruned. +*/ + private def isRedundant(tree: DecisionTreeModel): Boolean = _isRedundant(tree.rootNode) --- End diff -- I have removed the documentation (but the return) for all the auxiliary method here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org