[GitHub] [spark] AmplabJenkins removed a comment on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing
AmplabJenkins removed a comment on pull request #28766: URL: https://github.com/apache/spark/pull/28766#issuecomment-641771608 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing
AmplabJenkins commented on pull request #28766: URL: https://github.com/apache/spark/pull/28766#issuecomment-641771608 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal edited a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
dilipbiswal edited a comment on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766979 @maropu > Have you checked my last comment? #28750 (comment) The PR itself looks okay. Sorry i missed that. I have added the comment now. @viirya > Looks okay although I think making it a long might be also good and simpler. We could make it a long. The only thing is we may still overflow but it will take perhaps long time to hit it. I can repro the StringIndexOutOfBoundsException after changing the length to long. I made a minor tweak to change the append method to fake the input's length to be Int.MaxValue and adjust the test to increase the loop count. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
AmplabJenkins removed a comment on pull request #28733: URL: https://github.com/apache/spark/pull/28733#issuecomment-641770390 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing
SparkQA removed a comment on pull request #28766: URL: https://github.com/apache/spark/pull/28766#issuecomment-641679788 **[Test build #123715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123715/testReport)** for PR 28766 at commit [`a11a049`](https://github.com/apache/spark/commit/a11a049ad735ea4375e1b742c2fd9ba0093248c8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing
SparkQA commented on pull request #28766: URL: https://github.com/apache/spark/pull/28766#issuecomment-641770461 **[Test build #123715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123715/testReport)** for PR 28766 at commit [`a11a049`](https://github.com/apache/spark/commit/a11a049ad735ea4375e1b742c2fd9ba0093248c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
AmplabJenkins commented on pull request #28733: URL: https://github.com/apache/spark/pull/28733#issuecomment-641770390 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
cloud-fan commented on a change in pull request #28733: URL: https://github.com/apache/spark/pull/28733#discussion_r437898264 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -198,6 +200,90 @@ trait PredicateHelper { case e: Unevaluable => false case e => e.children.forall(canEvaluateWithinJoin) } + + /** + * Convert an expression into conjunctive normal form. + * Definition and algorithm: https://en.wikipedia.org/wiki/Conjunctive_normal_form + * CNF can explode exponentially in the size of the input expression when converting Or clauses. + * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases. + * + * @param condition to be conversed into CNF. + * @return If the number of expressions exceeds threshold on converting Or, return Seq.empty. + * If the conversion repeatedly expands nondeterministic expressions, return Seq.empty. + * Otherwise, return the converted result as sequence of disjunctive expressions. + */ + def conjunctiveNormalForm(condition: Expression): Seq[Expression] = { +val postOrderNodes = postOrderTraversal(condition) +val resultStack = new mutable.Stack[Seq[Expression]] +val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount +// Bottom up approach to get CNF of sub-expressions +while (postOrderNodes.nonEmpty) { + val cnf = postOrderNodes.pop() match { +case _: And => + val right: Seq[Expression] = resultStack.pop() + val left: Seq[Expression] = resultStack.pop() + left ++ right +case _: Or => + // For each side, there is no need to expand predicates of the same references. + // So here we can aggregate predicates of the same references as one single predicate, + // for reducing the size of pushed down predicates and corresponding codegen. + val right = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + val left = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + // Stop the loop whenever the result exceeds the `maxCnfNodeCount` + if (left.size * right.size > maxCnfNodeCount) { +Seq.empty + } else { +for {x <- left; y <- right} yield Or(x, y) + } +case other => other :: Nil + } + if (cnf.isEmpty) { +return Seq.empty + } + if (resultStack.length != 1) { +logWarning("The length of CNF conversion result stack is supposed to be 1. There might " + + "be something wrong with CNF conversion.") + } + resultStack.push(cnf) +} +resultStack.top + } + + private def aggregateExpressionsOfSameQualifiers( Review comment: `groupExprsByQualifier ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"
AmplabJenkins removed a comment on pull request #28775: URL: https://github.com/apache/spark/pull/28775#issuecomment-641770136 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28737: [SPARK-31913][SQL] Fix StackOverflowError in FileScanRDD
xuanyuanking commented on pull request #28737: URL: https://github.com/apache/spark/pull/28737#issuecomment-641769868 Let me clarify. The issue is the recursive calls in FileScanRDD will cause StackOverflowError while we have too many empty files. Could you please quantify the number of empty files in your env? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"
AmplabJenkins commented on pull request #28775: URL: https://github.com/apache/spark/pull/28775#issuecomment-641770136 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
SparkQA commented on pull request #28733: URL: https://github.com/apache/spark/pull/28733#issuecomment-641769586 **[Test build #123729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123729/testReport)** for PR 28733 at commit [`15437b3`](https://github.com/apache/spark/commit/15437b325402b4743a323c6c08f5b72990934547). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
viirya commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r437897971 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV1FilterSuite.scala ## @@ -19,7 +19,7 @@ package org.apache.spark.sql.execution.datasources.orc import scala.collection.JavaConverters._ import org.apache.spark.SparkConf -import org.apache.spark.sql.{Column, DataFrame} +import org.apache.spark.sql.{Column, DataFrame, Row} Review comment: thanks. removed unnecessary change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
viirya commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r437897740 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def getNameToOrcFieldMap( + schema: StructType, + caseSensitive: Boolean): Map[String, DataType] = { +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper + +def getPrimitiveFields( +fields: Seq[StructField], +parentFieldNames: Array[String] = Array.empty): Seq[(String, DataType)] = { + fields.flatMap { f => +f.dataType match { + case st: StructType => +getPrimitiveFields(st.fields.toSeq, parentFieldNames :+ f.name) + case BinaryType => None + case _: AtomicType => +Some(((parentFieldNames :+ f.name).toSeq.quoted, f.dataType)) Review comment: Okay, changed to `Seq`. Actually it was following a similar method in `ParquetFilters`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"
SparkQA removed a comment on pull request #28775: URL: https://github.com/apache/spark/pull/28775#issuecomment-641706549 **[Test build #123721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123721/testReport)** for PR 28775 at commit [`3d55ef8`](https://github.com/apache/spark/commit/3d55ef8a63f1d6a698a63882d5421f4eb385240b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"
SparkQA commented on pull request #28775: URL: https://github.com/apache/spark/pull/28775#issuecomment-641768855 **[Test build #123721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123721/testReport)** for PR 28775 at commit [`3d55ef8`](https://github.com/apache/spark/commit/3d55ef8a63f1d6a698a63882d5421f4eb385240b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
viirya commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r437897212 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def getNameToOrcFieldMap( + schema: StructType, + caseSensitive: Boolean): Map[String, DataType] = { +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper + +def getPrimitiveFields( +fields: Seq[StructField], +parentFieldNames: Array[String] = Array.empty): Seq[(String, DataType)] = { Review comment: Using `Seq` now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
viirya commented on a change in pull request #28743: URL: https://github.com/apache/spark/pull/28743#discussion_r437896443 ## File path: python/pyspark/sql/pandas/serializers.py ## @@ -150,15 +151,22 @@ def _create_batch(self, series): series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) def create_array(s, t): -mask = s.isnull() +# Create with __arrow_array__ if the series' backing array implements it +series_array = getattr(s, 'array', s._values) +if hasattr(series_array, "__arrow_array__"): +return series_array.__arrow_array__(type=t) + # Ensure timestamp series are in expected form for Spark internal representation if t is not None and pa.types.is_timestamp(t): s = _check_series_convert_timestamps_internal(s, self._timezone) -elif type(s.dtype) == pd.CategoricalDtype: +elif is_categorical_dtype(s.dtype): # Note: This can be removed once minimum pyarrow version is >= 0.16.1 s = s.astype(s.dtypes.categories.dtype) try: -array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) +mask = s.isnull() +# pass _ndarray_values to avoid potential failed type checks from pandas array types Review comment: Is there any test case for this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
AmplabJenkins removed a comment on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766412 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123728/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
AmplabJenkins removed a comment on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766399 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
dilipbiswal commented on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766979 @maropu > Have you checked my last comment? #28750 (comment) The PR itself looks okay. Sorry i missed that. I have added the comment now. @viirya We could make it a long. The only thing is we may still overflow but it will take perhaps long time to hit it. I can repro the StringIndexOutOfBoundsException after changing the length to long. I made a minor tweak to change the append method to fake the input's length to be Int.MaxValue and adjust the test to increase the loop count. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
SparkQA removed a comment on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641765460 **[Test build #123728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123728/testReport)** for PR 28750 at commit [`1050df3`](https://github.com/apache/spark/commit/1050df32690bd4a1ad9fd92cf680a63ff41cbf68). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
AmplabJenkins removed a comment on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766107 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
AmplabJenkins commented on pull request #28761: URL: https://github.com/apache/spark/pull/28761#issuecomment-641766138 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
SparkQA commented on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766373 **[Test build #123728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123728/testReport)** for PR 28750 at commit [`1050df3`](https://github.com/apache/spark/commit/1050df32690bd4a1ad9fd92cf680a63ff41cbf68). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
AmplabJenkins removed a comment on pull request #28761: URL: https://github.com/apache/spark/pull/28761#issuecomment-641766138 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
AmplabJenkins commented on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766399 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
AmplabJenkins commented on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641766107 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
AmplabJenkins commented on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641765619 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
AmplabJenkins removed a comment on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641765619 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
SparkQA commented on pull request #28761: URL: https://github.com/apache/spark/pull/28761#issuecomment-641765400 **[Test build #123727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123727/testReport)** for PR 28761 at commit [`bd691ed`](https://github.com/apache/spark/commit/bd691ed16eade2e63c0fdd8d2bbd88282f6c4662). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
SparkQA commented on pull request #28750: URL: https://github.com/apache/spark/pull/28750#issuecomment-641765460 **[Test build #123728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123728/testReport)** for PR 28750 at commit [`1050df3`](https://github.com/apache/spark/commit/1050df32690bd4a1ad9fd92cf680a63ff41cbf68). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
viirya commented on a change in pull request #28743: URL: https://github.com/apache/spark/pull/28743#discussion_r437893088 ## File path: python/pyspark/sql/pandas/conversion.py ## @@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone): # Create the Spark schema from list of names passed in with Arrow types if isinstance(schema, (list, tuple)): -arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) +inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True) Review comment: So without this change, `pa.Schema.from_pandas` cannot handle pandas extension types and pd.NA values? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
SparkQA commented on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641764410 **[Test build #123720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123720/testReport)** for PR 28412 at commit [`d6c7d98`](https://github.com/apache/spark/commit/d6c7d988bd9e39caebb9a33f8c01ee230b6c2a39). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
SparkQA removed a comment on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641701635 **[Test build #123720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123720/testReport)** for PR 28412 at commit [`d6c7d98`](https://github.com/apache/spark/commit/d6c7d988bd9e39caebb9a33f8c01ee230b6c2a39). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
zhli1142015 edited a comment on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768 > Then how about capture the exception and ask the user to increase the related configuration or try loading the page again? Because there is disk space limitation, we can only mitigate it by stopping service and manually cleaning disk cache, this is a little annoying. > What is the benefit of this PR to users? I think the cause of issue is resource leaking (file handle on Windows which prevent releasing space by `HistoryServerDiskManager` ) caused by race condition, my pr is trying to fix this. we actually use spark history server as long running service to provide diagnostic experience to others. The benefit to us is we don't need to stop service and manually restart HS after some period. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
zhli1142015 edited a comment on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768 > Then how about capture the exception and ask the user to increase the related configuration or try loading the page again? Because there is disk space limitation, we can only mitigate it by stopping service and manually cleaning disk cache, this is a little annoying. > What is the benefit of this PR to users? I think the cause of issue is resource leaking (file handle on Windows which prevent releasing space by `HistoryServerDiskManager` ) caused by race condition, my pr is trying to fix this. we actually use spark history server to provide diagnostic experience to others. The benefit to us is we don't need to stop service and manually restart HS after some period. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
viirya commented on a change in pull request #28743: URL: https://github.com/apache/spark/pull/28743#discussion_r437887602 ## File path: python/pyspark/sql/pandas/conversion.py ## @@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone): # Create the Spark schema from list of names passed in with Arrow types if isinstance(schema, (list, tuple)): -arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) +inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True) + for s in (pdf[c] for c in pdf)] struct = StructType() -for name, field in zip(schema, arrow_schema): -struct.add(name, from_arrow_type(field.type), nullable=field.nullable) +for name, t in zip(schema, inferred_types): +struct.add(name, from_arrow_type(t), nullable=True) Review comment: Let's add a comment here to explain it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27617: [SPARK-30865][SQL] Refactor DateTimeUtils
AmplabJenkins removed a comment on pull request #27617: URL: https://github.com/apache/spark/pull/27617#issuecomment-641754416 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27617: [SPARK-30865][SQL] Refactor DateTimeUtils
AmplabJenkins commented on pull request #27617: URL: https://github.com/apache/spark/pull/27617#issuecomment-641754416 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
zhli1142015 edited a comment on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768 > Then how about capture the exception and ask the user to increase the related configuration or try loading the page again? Because there is disk space limitation, we can only mitigate it by stopping service and manually cleaning disk cache, this is a little annoying. > What is the benefit of this PR to users? I think the cause of issue is resource leaking (file handle on Windows which prevent releasing space by `HistoryServerDiskManager` ) caused by race condition, my pr is trying to fix this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27617: [SPARK-30865][SQL] Refactor DateTimeUtils
SparkQA commented on pull request #27617: URL: https://github.com/apache/spark/pull/27617#issuecomment-641753870 **[Test build #123726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123726/testReport)** for PR 27617 at commit [`311a47e`](https://github.com/apache/spark/commit/311a47e433a66a932d67bf02ff587ec3f383653a). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhli1142015 commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
zhli1142015 commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768 > Then how about capture the exception and ask the user to increase the related configuration or try loading the page again? Because there is disk space limitation, we can only mitigate it by stopping service and manually cleaning disk cache, this is a little annoying. > What is the benefit of this PR to users? I think the cause of issue is race condition caused resource leaking (file handle on Windows which prevent releasing space by `HistoryServerDiskManager` ), my pr is trying to fix this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
gengliangwang commented on a change in pull request #28733: URL: https://github.com/apache/spark/pull/28733#discussion_r437884631 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -198,6 +200,90 @@ trait PredicateHelper { case e: Unevaluable => false case e => e.children.forall(canEvaluateWithinJoin) } + + /** + * Convert an expression into conjunctive normal form. + * Definition and algorithm: https://en.wikipedia.org/wiki/Conjunctive_normal_form + * CNF can explode exponentially in the size of the input expression when converting Or clauses. + * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases. + * + * @param condition to be conversed into CNF. + * @return If the number of expressions exceeds threshold on converting Or, return Seq.empty. + * If the conversion repeatedly expands nondeterministic expressions, return Seq.empty. + * Otherwise, return the converted result as sequence of disjunctive expressions. + */ + def conjunctiveNormalForm(condition: Expression): Seq[Expression] = { +val postOrderNodes = postOrderTraversal(condition) +val resultStack = new mutable.Stack[Seq[Expression]] +val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount +// Bottom up approach to get CNF of sub-expressions +while (postOrderNodes.nonEmpty) { + val cnf = postOrderNodes.pop() match { +case _: And => + val right: Seq[Expression] = resultStack.pop() + val left: Seq[Expression] = resultStack.pop() + left ++ right +case _: Or => + // For each side, there is no need to expand predicates of the same references. + // So here we can aggregate predicates of the same references as one single predicate, + // for reducing the size of pushed down predicates and corresponding codegen. + val right = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + val left = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + // Stop the loop whenever the result exceeds the `maxCnfNodeCount` + if (left.size * right.size > maxCnfNodeCount) { +Seq.empty + } else { +for {x <- left; y <- right} yield Or(x, y) + } +case other => other :: Nil + } + if (cnf.isEmpty) { +return Seq.empty + } + if (resultStack.length != 1) { +logWarning("The length of CNF conversion result stack is supposed to be 1. There might " + Review comment: Well this should never happen. But yes let's return Nil This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
cloud-fan commented on a change in pull request #28733: URL: https://github.com/apache/spark/pull/28733#discussion_r437882318 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -198,6 +200,90 @@ trait PredicateHelper { case e: Unevaluable => false case e => e.children.forall(canEvaluateWithinJoin) } + + /** + * Convert an expression into conjunctive normal form. + * Definition and algorithm: https://en.wikipedia.org/wiki/Conjunctive_normal_form + * CNF can explode exponentially in the size of the input expression when converting Or clauses. + * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases. + * + * @param condition to be conversed into CNF. + * @return If the number of expressions exceeds threshold on converting Or, return Seq.empty. + * If the conversion repeatedly expands nondeterministic expressions, return Seq.empty. + * Otherwise, return the converted result as sequence of disjunctive expressions. + */ + def conjunctiveNormalForm(condition: Expression): Seq[Expression] = { +val postOrderNodes = postOrderTraversal(condition) +val resultStack = new mutable.Stack[Seq[Expression]] +val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount +// Bottom up approach to get CNF of sub-expressions +while (postOrderNodes.nonEmpty) { + val cnf = postOrderNodes.pop() match { +case _: And => + val right: Seq[Expression] = resultStack.pop() + val left: Seq[Expression] = resultStack.pop() + left ++ right +case _: Or => + // For each side, there is no need to expand predicates of the same references. + // So here we can aggregate predicates of the same references as one single predicate, + // for reducing the size of pushed down predicates and corresponding codegen. + val right = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + val left = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + // Stop the loop whenever the result exceeds the `maxCnfNodeCount` + if (left.size * right.size > maxCnfNodeCount) { +Seq.empty + } else { +for {x <- left; y <- right} yield Or(x, y) + } +case other => other :: Nil + } + if (cnf.isEmpty) { +return Seq.empty + } + if (resultStack.length != 1) { +logWarning("The length of CNF conversion result stack is supposed to be 1. There might " + Review comment: shall we return Nil from here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
gengliangwang commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641747325 Then how about capture the exception and ask the user to increase the related configuration or try loading the page again? What is the benefit of this PR to users? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] agrawaldevesh commented on pull request #27636: [SPARK-30873][CORE][YARN]Handling Node Decommissioning for Yarn cluster manger in Spark
agrawaldevesh commented on pull request #27636: URL: https://github.com/apache/spark/pull/27636#issuecomment-641747304 @SaurabhChawla100 , can you briefly update the PR description to reflect how work relates to the recently merged in https://github.com/apache/spark/pull/27864 ? Perhaps you can leverage or enhance the abstractions added in that PR a bit ? I also don't fully understand the relationship b/w this PR and the original decommissioning PR https://github.com/apache/spark/pull/26440. I am trying to get a sense of the end state with all these multiple decommissioning PR's trying to stretch the framework in different ways. @holdenk or @prakharjain09, you recently (greatly) enhanced Spark's decommissioning story and I am curious on your thoughts on this PR and how you see it fitting it in with the work that you have done. From what I can tell: * https://github.com/apache/spark/pull/26440 improved the decommissioning for Compute: by not scheduling work on the executors that will be removed soon. It seemed to be a bit k8s oriented. * While https://github.com/apache/spark/pull/27864 improves this further by eagerly replicating the cached blocks but not like regular shuffle blocks. * This PR https://github.com/apache/spark/pull/27636 has a bit of YARN focus and it clears the shuffle state to force an eager re-computation and has special handling for ignoring the fetch failures. But it does not seem to be building on top of the previous two PR's. Thank you for working on this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
gengliangwang commented on a change in pull request #28733: URL: https://github.com/apache/spark/pull/28733#discussion_r437880281 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -198,6 +200,90 @@ trait PredicateHelper { case e: Unevaluable => false case e => e.children.forall(canEvaluateWithinJoin) } + + /** + * Convert an expression into conjunctive normal form. + * Definition and algorithm: https://en.wikipedia.org/wiki/Conjunctive_normal_form + * CNF can explode exponentially in the size of the input expression when converting Or clauses. + * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases. + * + * @param condition to be conversed into CNF. + * @return If the number of expressions exceeds threshold on converting Or, return Seq.empty. + * If the conversion repeatedly expands nondeterministic expressions, return Seq.empty. + * Otherwise, return the converted result as sequence of disjunctive expressions. + */ + def conjunctiveNormalForm(condition: Expression): Seq[Expression] = { +val postOrderNodes = postOrderTraversal(condition) +val resultStack = new mutable.Stack[Seq[Expression]] +val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount +// Bottom up approach to get CNF of sub-expressions +while (postOrderNodes.nonEmpty) { + val cnf = postOrderNodes.pop() match { +case _: And => + val right: Seq[Expression] = resultStack.pop() + val left: Seq[Expression] = resultStack.pop() + left ++ right +case _: Or => + // For each side, there is no need to expand predicates of the same references. + // So here we can aggregate predicates of the same references as one single predicate, + // for reducing the size of pushed down predicates and corresponding codegen. + val right = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + val left = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + // Stop the loop whenever the result exceeds the `maxCnfNodeCount` + if (left.size * right.size > maxCnfNodeCount) { +Seq.empty + } else { +for {x <- left; y <- right} yield Or(x, y) + } +case other => other :: Nil + } + if (cnf.isEmpty) { +return Seq.empty + } + if (resultStack.length != 1) { +logWarning("The length of CNF conversion result stack is supposed to be 1. There might " + + "be something wrong with CNF conversion.") + } + resultStack.push(cnf) +} +resultStack.top + } + + private def aggregateExpressionsOfSameQualifiers( Review comment: Hmm, then the name contains two `By` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table
karuppayya commented on pull request #28662: URL: https://github.com/apache/spark/pull/28662#issuecomment-641746170 @viirya @maropu Can you please help review this PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
cloud-fan commented on a change in pull request #28733: URL: https://github.com/apache/spark/pull/28733#discussion_r437879694 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -198,6 +200,90 @@ trait PredicateHelper { case e: Unevaluable => false case e => e.children.forall(canEvaluateWithinJoin) } + + /** + * Convert an expression into conjunctive normal form. + * Definition and algorithm: https://en.wikipedia.org/wiki/Conjunctive_normal_form + * CNF can explode exponentially in the size of the input expression when converting Or clauses. + * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases. + * + * @param condition to be conversed into CNF. + * @return If the number of expressions exceeds threshold on converting Or, return Seq.empty. + * If the conversion repeatedly expands nondeterministic expressions, return Seq.empty. + * Otherwise, return the converted result as sequence of disjunctive expressions. + */ + def conjunctiveNormalForm(condition: Expression): Seq[Expression] = { +val postOrderNodes = postOrderTraversal(condition) +val resultStack = new mutable.Stack[Seq[Expression]] +val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount +// Bottom up approach to get CNF of sub-expressions +while (postOrderNodes.nonEmpty) { + val cnf = postOrderNodes.pop() match { +case _: And => + val right: Seq[Expression] = resultStack.pop() + val left: Seq[Expression] = resultStack.pop() + left ++ right +case _: Or => + // For each side, there is no need to expand predicates of the same references. + // So here we can aggregate predicates of the same references as one single predicate, + // for reducing the size of pushed down predicates and corresponding codegen. + val right = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + val left = aggregateExpressionsOfSameQualifiers(resultStack.pop()) + // Stop the loop whenever the result exceeds the `maxCnfNodeCount` + if (left.size * right.size > maxCnfNodeCount) { +Seq.empty + } else { +for {x <- left; y <- right} yield Or(x, y) + } +case other => other :: Nil + } + if (cnf.isEmpty) { +return Seq.empty + } + if (resultStack.length != 1) { +logWarning("The length of CNF conversion result stack is supposed to be 1. There might " + + "be something wrong with CNF conversion.") + } + resultStack.push(cnf) +} +resultStack.top + } + + private def aggregateExpressionsOfSameQualifiers( Review comment: nit: `groupByExprsByQualifier` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhli1142015 commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
zhli1142015 commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641744773 > could you describe an end-to-end use case that can reproduce the error page in PR description? Does it only happen when leveldb is evicted or UI server is Sure, for our cases, we host many big event files (>200, average level db size is 60~70 Mb) in history server, so when we switch pages for different applications , it would trigger `HistoryServerDiskManager` ro release disk space. then error happened. For reproducing in dev machine, you can specify below configuration and run HS with two applications, and open the first application job page, then open the second one. spark.history.retainedApplications 1 spark.history.store.maxDiskUsage 10k spark.history.store.path d://cache This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
AmplabJenkins removed a comment on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641736976 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
AmplabJenkins commented on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641736976 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
SparkQA removed a comment on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641721923 **[Test build #123725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)** for PR 28743 at commit [`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
moskvax commented on a change in pull request #28743: URL: https://github.com/apache/spark/pull/28743#discussion_r437872692 ## File path: python/pyspark/sql/pandas/serializers.py ## @@ -150,15 +151,22 @@ def _create_batch(self, series): series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) def create_array(s, t): -mask = s.isnull() +# Create with __arrow_array__ if the series' backing array implements it +series_array = getattr(s, 'array', s._values) +if hasattr(series_array, "__arrow_array__"): +return series_array.__arrow_array__(type=t) + # Ensure timestamp series are in expected form for Spark internal representation if t is not None and pa.types.is_timestamp(t): s = _check_series_convert_timestamps_internal(s, self._timezone) -elif type(s.dtype) == pd.CategoricalDtype: +elif is_categorical_dtype(s.dtype): Review comment: By the way, this change was made as `CategoricalDtype` is only imported into the root pandas namespace after pandas 0.24.0, which was causing `AttributeError` when testing with earlier versions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
SparkQA commented on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641736450 **[Test build #123725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)** for PR 28743 at commit [`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siknezevic commented on pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
siknezevic commented on pull request #27246: URL: https://github.com/apache/spark/pull/27246#issuecomment-641736377 > Also, could you add some benchmark classes in https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark ? Hello @maropu, I checked available benchmarks and I can see that there is already benchmark that can be utilized. https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala I am using Databrick’s Toolkit for all my testing (10TB, 100TB datasets). TPCDSQueryBenchmark is based on Databrick’s Toolkit. I was able to test spilling with TPCDSQueryBenchmark benchmark. I executed benchmark in the following way: /opt/spark/bin/spark-submit --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --conf 'spark.sql.sortMergeJoinExec.buffer.spill.threshold=6000' --conf 'spark.sql.sortMergeJoinExec.buffer.in.memory.threshold=1000' '/tmp/spark-sql_2.11-2.4.6-SNAPSHOT-tests.jar' --data-location '/user/testusera1/tpcds/datasets-1g/sf1-parquet/useDecimal=true,useDate=true,filterNull=false' --query-filter 'q14a' It runs fine and I am able to pass to it Spark config parameters to trigger spilling. I believe that we do not need new benchmark. Do you agree? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
gengliangwang commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641735784 @zhli1142015 sorry I left comments in the code before I read the discussion in the PR. So, before you update the related code, could you describe an end-to-end use case that can reproduce the error page in PR description? Does it only happen when leveldb is evicted or UI server is shutdown? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a change in pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing
yaooqinn commented on a change in pull request #28766: URL: https://github.com/apache/spark/pull/28766#discussion_r437871769 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala ## @@ -433,4 +433,35 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite { assert(formatter.format(date(1970, 4, 10)) == "100") } } + + test("SPARK-31939: Fix Parsing day of year when year field pattern is missing") { +// resolved to queryable LocaleDate or fail directly +val f0 = TimestampFormatter("-dd-DD", UTC, isParsing = true) Review comment: Sounds good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a change in pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing
yaooqinn commented on a change in pull request #28766: URL: https://github.com/apache/spark/pull/28766#discussion_r437871990 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala ## @@ -39,6 +39,18 @@ trait DateTimeFormatterHelper { } } + private def verifyLocalDate( + accessor: TemporalAccessor, field: ChronoField, candidate: LocalDate): Unit = { +if (accessor.isSupported(field) && candidate.isSupported(field)) { Review comment: For the time being, yes. I can remove this condition ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala ## @@ -39,6 +39,18 @@ trait DateTimeFormatterHelper { } } + private def verifyLocalDate( + accessor: TemporalAccessor, field: ChronoField, candidate: LocalDate): Unit = { +if (accessor.isSupported(field) && candidate.isSupported(field)) { + val actual = accessor.get(field) + val expected = candidate.get(field) + if (actual != expected) { +throw new DateTimeException(s"Conflict found: Field $field $actual differs from" + + s" $field $expected derived from $candidate") Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
AmplabJenkins removed a comment on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641732806 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
gengliangwang commented on a change in pull request #28769: URL: https://github.com/apache/spark/pull/28769#discussion_r437870387 ## File path: common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java ## @@ -276,6 +276,41 @@ public void testNegativeIndexValues() throws Exception { assertEquals(expected, results); } + @Test + public void testCloseLevelDBIterator() throws Exception { +// SPARK-31929: test when LevelDB.close() is called, related LevelDBIterators +// are closed. And files opened by iterators are also closed. +File dbpathForCloseTest = File +.createTempFile( Review comment: please change to indents to two spaces in this file This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
AmplabJenkins commented on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641732806 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
gengliangwang commented on a change in pull request #28769: URL: https://github.com/apache/spark/pull/28769#discussion_r437869878 ## File path: common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java ## @@ -189,7 +198,12 @@ public void delete(Class type, Object naturalKey) throws Exception { @Override public Iterator iterator() { try { - return new LevelDBIterator<>(type, LevelDB.this, this); + LevelDBIterator iterator = new LevelDBIterator<>( + type, Review comment: Nit: put all the parameters to one line? ``` LevelDBIterator iterator = new LevelDBIterator<>(type, LevelDB.this, this);``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dilipbiswal commented on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list
dilipbiswal commented on pull request #28773: URL: https://github.com/apache/spark/pull/28773#issuecomment-641732487 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
SparkQA removed a comment on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641689848 **[Test build #123719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123719/testReport)** for PR 28412 at commit [`1e514b9`](https://github.com/apache/spark/commit/1e514b910a56b719a08d6f7a7689a2a53dcc06a5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
gengliangwang commented on a change in pull request #28769: URL: https://github.com/apache/spark/pull/28769#discussion_r437869753 ## File path: common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java ## @@ -256,6 +275,7 @@ void closeIterator(LevelDBIterator it) throws IOException { DB _db = this._db.get(); if (_db != null) { it.close(); +iteratorTracker.remove(it); Review comment: shall we remove it even when `_db` is null? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster
SparkQA commented on pull request #28412: URL: https://github.com/apache/spark/pull/28412#issuecomment-641732044 **[Test build #123719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123719/testReport)** for PR 28412 at commit [`1e514b9`](https://github.com/apache/spark/commit/1e514b910a56b719a08d6f7a7689a2a53dcc06a5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list
AmplabJenkins commented on pull request #28773: URL: https://github.com/apache/spark/pull/28773#issuecomment-641725711 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list
AmplabJenkins removed a comment on pull request #28773: URL: https://github.com/apache/spark/pull/28773#issuecomment-641725711 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list
SparkQA removed a comment on pull request #28773: URL: https://github.com/apache/spark/pull/28773#issuecomment-641651098 **[Test build #123711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123711/testReport)** for PR 28773 at commit [`1013ac8`](https://github.com/apache/spark/commit/1013ac8064c1e380f4be4c297c165fae1a20602e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list
SparkQA commented on pull request #28773: URL: https://github.com/apache/spark/pull/28773#issuecomment-641724830 **[Test build #123711 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123711/testReport)** for PR 28773 at commit [`1013ac8`](https://github.com/apache/spark/commit/1013ac8064c1e380f4be4c297c165fae1a20602e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
AmplabJenkins removed a comment on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641722284 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28349/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
AmplabJenkins removed a comment on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641722278 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
AmplabJenkins commented on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641722278 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
SparkQA commented on pull request #28743: URL: https://github.com/apache/spark/pull/28743#issuecomment-641721923 **[Test build #123725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)** for PR 28743 at commit [`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
AmplabJenkins removed a comment on pull request #27507: URL: https://github.com/apache/spark/pull/27507#issuecomment-641719947 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123716/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
AmplabJenkins removed a comment on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641720102 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
AmplabJenkins commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641720102 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
SparkQA removed a comment on pull request #27507: URL: https://github.com/apache/spark/pull/27507#issuecomment-641679815 **[Test build #123716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123716/testReport)** for PR 27507 at commit [`ca6c1c5`](https://github.com/apache/spark/commit/ca6c1c5eef73ae1a3d33f17acebcfcc3d77d9d63). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
AmplabJenkins removed a comment on pull request #27507: URL: https://github.com/apache/spark/pull/27507#issuecomment-641719937 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28776: [3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options
SparkQA commented on pull request #28776: URL: https://github.com/apache/spark/pull/28776#issuecomment-641719758 **[Test build #123723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123723/testReport)** for PR 28776 at commit [`f6cca6b`](https://github.com/apache/spark/commit/f6cca6b5163acef655d0c0e3d6cd4848b00314e0). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
SparkQA commented on pull request #27507: URL: https://github.com/apache/spark/pull/27507#issuecomment-641719783 **[Test build #123716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123716/testReport)** for PR 27507 at commit [`ca6c1c5`](https://github.com/apache/spark/commit/ca6c1c5eef73ae1a3d33f17acebcfcc3d77d9d63). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
SparkQA commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641719773 **[Test build #123724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123724/testReport)** for PR 28769 at commit [`84e9012`](https://github.com/apache/spark/commit/84e9012b49af708ca1b4e5f22f495d8ef38f3122). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all
AmplabJenkins commented on pull request #27507: URL: https://github.com/apache/spark/pull/27507#issuecomment-641719937 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28776: [3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options
AmplabJenkins removed a comment on pull request #28776: URL: https://github.com/apache/spark/pull/28776#issuecomment-641717889 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
AmplabJenkins removed a comment on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641186609 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
cloud-fan commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641719161 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
moskvax commented on a change in pull request #28743: URL: https://github.com/apache/spark/pull/28743#discussion_r437858625 ## File path: python/pyspark/sql/tests/test_arrow.py ## @@ -30,10 +30,14 @@ pandas_requirement_message, pyarrow_requirement_message from pyspark.testing.utils import QuietTest from pyspark.util import _exception_message +from distutils.version import LooseVersion if have_pandas: import pandas as pd from pandas.util.testing import assert_frame_equal +pandas_version = LooseVersion(pd.__version__) +else: +pandas_version = LooseVersion("0") Review comment: Nice, will update This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
cloud-fan commented on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641718840 cc @gengliangwang This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns
moskvax commented on a change in pull request #28743: URL: https://github.com/apache/spark/pull/28743#discussion_r437858389 ## File path: python/pyspark/sql/pandas/conversion.py ## @@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone): # Create the Spark schema from list of names passed in with Arrow types if isinstance(schema, (list, tuple)): -arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) +inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True) + for s in (pdf[c] for c in pdf)] struct = StructType() -for name, field in zip(schema, arrow_schema): -struct.add(name, from_arrow_type(field.type), nullable=field.nullable) +for name, t in zip(schema, inferred_types): +struct.add(name, from_arrow_type(t), nullable=True) Review comment: `infer_type` only returns a type, not a `field`, which would supposedly have nullability information. But it appears that in the implementation of `Schema.from_pandas` ([link](https://github.com/apache/arrow/blob/b058cf0d1c26ad7984c104bb84322cc7dcc66f00/python/pyarrow/types.pxi#L1328)), inferring nullability was not actually done and the default `nullable=True` would always be returned. So this change is just following the existing behaviour of `Schema.from_pandas`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.
AmplabJenkins commented on pull request #28774: URL: https://github.com/apache/spark/pull/28774#issuecomment-641718511 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.
AmplabJenkins removed a comment on pull request #28774: URL: https://github.com/apache/spark/pull/28774#issuecomment-641718511 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28776: [3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options
AmplabJenkins commented on pull request #28776: URL: https://github.com/apache/spark/pull/28776#issuecomment-641717889 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.
SparkQA removed a comment on pull request #28774: URL: https://github.com/apache/spark/pull/28774#issuecomment-641673972 **[Test build #123714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123714/testReport)** for PR 28774 at commit [`c2b6b86`](https://github.com/apache/spark/commit/c2b6b86d2c450d35d9451929eab71eaeed9801c1). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request #28776: [SPARK-31935][SQL] Hadoop file system config should be effective in data source options
gengliangwang opened a new pull request #28776: URL: https://github.com/apache/spark/pull/28776 ### What changes were proposed in this pull request? Mkae Hadoop file system config effective in data source options. From `org.apache.hadoop.fs.FileSystem.java`: ``` public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null && authority == null) { // use default FS return get(conf); } if (scheme != null && authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme())// if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); } ``` Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`. After changes, we can specify authority and URI schema related configurations for scanning file systems. This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`. ### Why are the changes needed? Allow users to specify authority and URI schema related Hadoop configurations for file source reading. ### Does this PR introduce _any_ user-facing change? Yes, the file system related Hadoop configuration in data source option will be effective on reading. ### How was this patch tested? Unit test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #28776: [SPARK-31935][SQL] Hadoop file system config should be effective in data source options
gengliangwang commented on pull request #28776: URL: https://github.com/apache/spark/pull/28776#issuecomment-641717785 This PR backports https://github.com/apache/spark/pull/28760 to branch-3.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.
SparkQA commented on pull request #28774: URL: https://github.com/apache/spark/pull/28774#issuecomment-641717749 **[Test build #123714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123714/testReport)** for PR 28774 at commit [`c2b6b86`](https://github.com/apache/spark/commit/c2b6b86d2c450d35d9451929eab71eaeed9801c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xccui commented on pull request #28768: [SPARK-31941][CORE] Replace SparkException to NoSuchElementException for applicationInfo in AppStatusStore
xccui commented on pull request #28768: URL: https://github.com/apache/spark/pull/28768#issuecomment-641714353 Sorry that I didn't realize the potential impact of using `SparkException` or `NoSuchElementException`. +1 to this change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close
zhli1142015 edited a comment on pull request #28769: URL: https://github.com/apache/spark/pull/28769#issuecomment-641708063 > Of course relying on finalize is wrong, but I don't think the intent was to rely on finalize. Not closing these iterators is a bug. I see one case it clearly isn't; there may be others but haven't spotted them. It'd be nice to fix them all instead of the change in this patch but we may want to fix what we can see and also make the change in this patch for now. Thanks for your comments, i get your point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org