[GitHub] [spark] AmplabJenkins removed a comment on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28766:
URL: https://github.com/apache/spark/pull/28766#issuecomment-641771608







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28766:
URL: https://github.com/apache/spark/pull/28766#issuecomment-641771608







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal edited a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


dilipbiswal edited a comment on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766979


   @maropu 
   > Have you checked my last comment? #28750 (comment) The PR itself looks 
okay.
   
   Sorry i missed that. I have added the comment now.
   
   @viirya 
   > Looks okay although I think making it a long might be also good and 
simpler.
   
   We could make it a long. The only thing is we may still overflow but it will 
take perhaps long time to hit it. I can repro the 
StringIndexOutOfBoundsException after changing the length to long. I made a 
minor tweak to change the append method to fake the input's length to be 
Int.MaxValue and adjust the test to increase the loop count.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28733:
URL: https://github.com/apache/spark/pull/28733#issuecomment-641770390







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28766:
URL: https://github.com/apache/spark/pull/28766#issuecomment-641679788


   **[Test build #123715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123715/testReport)**
 for PR 28766 at commit 
[`a11a049`](https://github.com/apache/spark/commit/a11a049ad735ea4375e1b742c2fd9ba0093248c8).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing

2020-06-09 Thread GitBox


SparkQA commented on pull request #28766:
URL: https://github.com/apache/spark/pull/28766#issuecomment-641770461


   **[Test build #123715 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123715/testReport)**
 for PR 28766 at commit 
[`a11a049`](https://github.com/apache/spark/commit/a11a049ad735ea4375e1b742c2fd9ba0093248c8).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28733:
URL: https://github.com/apache/spark/pull/28733#issuecomment-641770390







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


cloud-fan commented on a change in pull request #28733:
URL: https://github.com/apache/spark/pull/28733#discussion_r437898264



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##
@@ -198,6 +200,90 @@ trait PredicateHelper {
 case e: Unevaluable => false
 case e => e.children.forall(canEvaluateWithinJoin)
   }
+
+  /**
+   * Convert an expression into conjunctive normal form.
+   * Definition and algorithm: 
https://en.wikipedia.org/wiki/Conjunctive_normal_form
+   * CNF can explode exponentially in the size of the input expression when 
converting Or clauses.
+   * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases.
+   *
+   * @param condition to be conversed into CNF.
+   * @return If the number of expressions exceeds threshold on converting Or, 
return Seq.empty.
+   * If the conversion repeatedly expands nondeterministic 
expressions, return Seq.empty.
+   * Otherwise, return the converted result as sequence of disjunctive 
expressions.
+   */
+  def conjunctiveNormalForm(condition: Expression): Seq[Expression] = {
+val postOrderNodes = postOrderTraversal(condition)
+val resultStack = new mutable.Stack[Seq[Expression]]
+val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount
+// Bottom up approach to get CNF of sub-expressions
+while (postOrderNodes.nonEmpty) {
+  val cnf = postOrderNodes.pop() match {
+case _: And =>
+  val right: Seq[Expression] = resultStack.pop()
+  val left: Seq[Expression] = resultStack.pop()
+  left ++ right
+case _: Or =>
+  // For each side, there is no need to expand predicates of the same 
references.
+  // So here we can aggregate predicates of the same references as one 
single predicate,
+  // for reducing the size of pushed down predicates and corresponding 
codegen.
+  val right = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  val left = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  // Stop the loop whenever the result exceeds the `maxCnfNodeCount`
+  if (left.size * right.size > maxCnfNodeCount) {
+Seq.empty
+  } else {
+for {x <- left; y <- right} yield Or(x, y)
+  }
+case other => other :: Nil
+  }
+  if (cnf.isEmpty) {
+return Seq.empty
+  }
+  if (resultStack.length != 1) {
+logWarning("The length of CNF conversion result stack is supposed to 
be 1. There might " +
+  "be something wrong with CNF conversion.")
+  }
+  resultStack.push(cnf)
+}
+resultStack.top
+  }
+
+  private def aggregateExpressionsOfSameQualifiers(

Review comment:
   `groupExprsByQualifier `





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28775:
URL: https://github.com/apache/spark/pull/28775#issuecomment-641770136







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28737: [SPARK-31913][SQL] Fix StackOverflowError in FileScanRDD

2020-06-09 Thread GitBox


xuanyuanking commented on pull request #28737:
URL: https://github.com/apache/spark/pull/28737#issuecomment-641769868


   Let me clarify. The issue is the recursive calls in FileScanRDD will cause 
StackOverflowError while we have too many empty files. Could you please 
quantify the number of empty files in your env?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28775:
URL: https://github.com/apache/spark/pull/28775#issuecomment-641770136







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


SparkQA commented on pull request #28733:
URL: https://github.com/apache/spark/pull/28733#issuecomment-641769586


   **[Test build #123729 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123729/testReport)**
 for PR 28733 at commit 
[`15437b3`](https://github.com/apache/spark/commit/15437b325402b4743a323c6c08f5b72990934547).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-09 Thread GitBox


viirya commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r437897971



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV1FilterSuite.scala
##
@@ -19,7 +19,7 @@ package org.apache.spark.sql.execution.datasources.orc
 import scala.collection.JavaConverters._
 
 import org.apache.spark.SparkConf
-import org.apache.spark.sql.{Column, DataFrame}
+import org.apache.spark.sql.{Column, DataFrame, Row}

Review comment:
   thanks. removed unnecessary change.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-09 Thread GitBox


viirya commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r437897740



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def getNameToOrcFieldMap(
+  schema: StructType,
+  caseSensitive: Boolean): Map[String, DataType] = {
+import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
+
+def getPrimitiveFields(
+fields: Seq[StructField],
+parentFieldNames: Array[String] = Array.empty): Seq[(String, 
DataType)] = {
+  fields.flatMap { f =>
+f.dataType match {
+  case st: StructType =>
+getPrimitiveFields(st.fields.toSeq, parentFieldNames :+ f.name)
+  case BinaryType => None
+  case _: AtomicType =>
+Some(((parentFieldNames :+ f.name).toSeq.quoted, f.dataType))

Review comment:
   Okay, changed to `Seq`. Actually it was following a similar method in 
`ParquetFilters`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28775:
URL: https://github.com/apache/spark/pull/28775#issuecomment-641706549


   **[Test build #123721 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123721/testReport)**
 for PR 28775 at commit 
[`3d55ef8`](https://github.com/apache/spark/commit/3d55ef8a63f1d6a698a63882d5421f4eb385240b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28775: [SPARK-31486][CORE][FOLLOW-UP] Use ConfigEntry for config "spark.standalone.submit.waitAppCompletion"

2020-06-09 Thread GitBox


SparkQA commented on pull request #28775:
URL: https://github.com/apache/spark/pull/28775#issuecomment-641768855


   **[Test build #123721 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123721/testReport)**
 for PR 28775 at commit 
[`3d55ef8`](https://github.com/apache/spark/commit/3d55ef8a63f1d6a698a63882d5421f4eb385240b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-09 Thread GitBox


viirya commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r437897212



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def getNameToOrcFieldMap(
+  schema: StructType,
+  caseSensitive: Boolean): Map[String, DataType] = {
+import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
+
+def getPrimitiveFields(
+fields: Seq[StructField],
+parentFieldNames: Array[String] = Array.empty): Seq[(String, 
DataType)] = {

Review comment:
   Using `Seq` now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437896443



##
File path: python/pyspark/sql/pandas/serializers.py
##
@@ -150,15 +151,22 @@ def _create_batch(self, series):
 series = ((s, None) if not isinstance(s, (list, tuple)) else s for s 
in series)
 
 def create_array(s, t):
-mask = s.isnull()
+# Create with __arrow_array__ if the series' backing array 
implements it
+series_array = getattr(s, 'array', s._values)
+if hasattr(series_array, "__arrow_array__"):
+return series_array.__arrow_array__(type=t)
+
 # Ensure timestamp series are in expected form for Spark internal 
representation
 if t is not None and pa.types.is_timestamp(t):
 s = _check_series_convert_timestamps_internal(s, 
self._timezone)
-elif type(s.dtype) == pd.CategoricalDtype:
+elif is_categorical_dtype(s.dtype):
 # Note: This can be removed once minimum pyarrow version is >= 
0.16.1
 s = s.astype(s.dtypes.categories.dtype)
 try:
-array = pa.Array.from_pandas(s, mask=mask, type=t, 
safe=self._safecheck)
+mask = s.isnull()
+# pass _ndarray_values to avoid potential failed type checks 
from pandas array types

Review comment:
   Is there any test case for this?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766412


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123728/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766399


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


dilipbiswal commented on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766979


   @maropu 
   > Have you checked my last comment? #28750 (comment) The PR itself looks 
okay.
   
   Sorry i missed that. I have added the comment now.
   
   @viirya 
   We could make it a long. The only thing is we may still overflow but it will 
take perhaps long time to hit it. I can repro the 
StringIndexOutOfBoundsException after changing the length to long. I made a 
minor tweak to change the append method to fake the input's length to be 
Int.MaxValue and adjust the test to increase the loop count.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641765460


   **[Test build #123728 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123728/testReport)**
 for PR 28750 at commit 
[`1050df3`](https://github.com/apache/spark/commit/1050df32690bd4a1ad9fd92cf680a63ff41cbf68).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766107







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28761:
URL: https://github.com/apache/spark/pull/28761#issuecomment-641766138







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


SparkQA commented on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766373


   **[Test build #123728 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123728/testReport)**
 for PR 28750 at commit 
[`1050df3`](https://github.com/apache/spark/commit/1050df32690bd4a1ad9fd92cf680a63ff41cbf68).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28761:
URL: https://github.com/apache/spark/pull/28761#issuecomment-641766138







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766399







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641766107







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641765619







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641765619







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-09 Thread GitBox


SparkQA commented on pull request #28761:
URL: https://github.com/apache/spark/pull/28761#issuecomment-641765400


   **[Test build #123727 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123727/testReport)**
 for PR 28761 at commit 
[`bd691ed`](https://github.com/apache/spark/commit/bd691ed16eade2e63c0fdd8d2bbd88282f6c4662).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28750: [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException

2020-06-09 Thread GitBox


SparkQA commented on pull request #28750:
URL: https://github.com/apache/spark/pull/28750#issuecomment-641765460


   **[Test build #123728 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123728/testReport)**
 for PR 28750 at commit 
[`1050df3`](https://github.com/apache/spark/commit/1050df32690bd4a1ad9fd92cf680a63ff41cbf68).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437893088



##
File path: python/pyspark/sql/pandas/conversion.py
##
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, 
timezone):
 
 # Create the Spark schema from list of names passed in with Arrow types
 if isinstance(schema, (list, tuple)):
-arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)

Review comment:
   So without this change, `pa.Schema.from_pandas` cannot handle pandas 
extension types and pd.NA values?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


SparkQA commented on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641764410


   **[Test build #123720 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123720/testReport)**
 for PR 28412 at commit 
[`d6c7d98`](https://github.com/apache/spark/commit/d6c7d988bd9e39caebb9a33f8c01ee230b6c2a39).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641701635


   **[Test build #123720 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123720/testReport)**
 for PR 28412 at commit 
[`d6c7d98`](https://github.com/apache/spark/commit/d6c7d988bd9e39caebb9a33f8c01ee230b6c2a39).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


zhli1142015 edited a comment on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768


   > Then how about capture the exception and ask the user to increase the 
related configuration or try loading the page again?
   
   Because there is disk space limitation, we can only mitigate it by stopping 
service and manually cleaning disk cache, this is a little annoying. 
   
   > What is the benefit of this PR to users?
   
   I think the cause of issue is resource leaking (file handle on Windows which 
prevent releasing space by `HistoryServerDiskManager` ) caused by race 
condition, my pr is trying to fix this. we actually use spark history server as 
long running service to provide diagnostic experience to others. The benefit to 
us is we don't need to stop service and manually restart HS after some period.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


zhli1142015 edited a comment on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768


   > Then how about capture the exception and ask the user to increase the 
related configuration or try loading the page again?
   
   Because there is disk space limitation, we can only mitigate it by stopping 
service and manually cleaning disk cache, this is a little annoying. 
   
   > What is the benefit of this PR to users?
   
   I think the cause of issue is resource leaking (file handle on Windows which 
prevent releasing space by `HistoryServerDiskManager` ) caused by race 
condition, my pr is trying to fix this. we actually use spark history server to 
provide diagnostic experience to others. The benefit to us is we don't need to 
stop service and manually restart HS after some period.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


viirya commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437887602



##
File path: python/pyspark/sql/pandas/conversion.py
##
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, 
timezone):
 
 # Create the Spark schema from list of names passed in with Arrow types
 if isinstance(schema, (list, tuple)):
-arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)
+  for s in (pdf[c] for c in pdf)]
 struct = StructType()
-for name, field in zip(schema, arrow_schema):
-struct.add(name, from_arrow_type(field.type), 
nullable=field.nullable)
+for name, t in zip(schema, inferred_types):
+struct.add(name, from_arrow_type(t), nullable=True)

Review comment:
   Let's add a comment here to explain it?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27617: [SPARK-30865][SQL] Refactor DateTimeUtils

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #27617:
URL: https://github.com/apache/spark/pull/27617#issuecomment-641754416







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #27617: [SPARK-30865][SQL] Refactor DateTimeUtils

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #27617:
URL: https://github.com/apache/spark/pull/27617#issuecomment-641754416







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


zhli1142015 edited a comment on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768


   > Then how about capture the exception and ask the user to increase the 
related configuration or try loading the page again?
   
   Because there is disk space limitation, we can only mitigate it by stopping 
service and manually cleaning disk cache, this is a little annoying. 
   
   > What is the benefit of this PR to users?
   
   I think the cause of issue is resource leaking (file handle on Windows which 
prevent releasing space by `HistoryServerDiskManager` ) caused by race 
condition, my pr is trying to fix this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #27617: [SPARK-30865][SQL] Refactor DateTimeUtils

2020-06-09 Thread GitBox


SparkQA commented on pull request #27617:
URL: https://github.com/apache/spark/pull/27617#issuecomment-641753870


   **[Test build #123726 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123726/testReport)**
 for PR 27617 at commit 
[`311a47e`](https://github.com/apache/spark/commit/311a47e433a66a932d67bf02ff587ec3f383653a).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhli1142015 commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


zhli1142015 commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641752768


   > Then how about capture the exception and ask the user to increase the 
related configuration or try loading the page again?
   
   Because there is disk space limitation, we can only mitigate it by stopping 
service and manually cleaning disk cache, this is a little annoying. 
   
   > What is the benefit of this PR to users?
   
   I think the cause of issue is race condition caused resource leaking (file 
handle on Windows which prevent releasing space by `HistoryServerDiskManager` 
), my pr is trying to fix this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


gengliangwang commented on a change in pull request #28733:
URL: https://github.com/apache/spark/pull/28733#discussion_r437884631



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##
@@ -198,6 +200,90 @@ trait PredicateHelper {
 case e: Unevaluable => false
 case e => e.children.forall(canEvaluateWithinJoin)
   }
+
+  /**
+   * Convert an expression into conjunctive normal form.
+   * Definition and algorithm: 
https://en.wikipedia.org/wiki/Conjunctive_normal_form
+   * CNF can explode exponentially in the size of the input expression when 
converting Or clauses.
+   * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases.
+   *
+   * @param condition to be conversed into CNF.
+   * @return If the number of expressions exceeds threshold on converting Or, 
return Seq.empty.
+   * If the conversion repeatedly expands nondeterministic 
expressions, return Seq.empty.
+   * Otherwise, return the converted result as sequence of disjunctive 
expressions.
+   */
+  def conjunctiveNormalForm(condition: Expression): Seq[Expression] = {
+val postOrderNodes = postOrderTraversal(condition)
+val resultStack = new mutable.Stack[Seq[Expression]]
+val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount
+// Bottom up approach to get CNF of sub-expressions
+while (postOrderNodes.nonEmpty) {
+  val cnf = postOrderNodes.pop() match {
+case _: And =>
+  val right: Seq[Expression] = resultStack.pop()
+  val left: Seq[Expression] = resultStack.pop()
+  left ++ right
+case _: Or =>
+  // For each side, there is no need to expand predicates of the same 
references.
+  // So here we can aggregate predicates of the same references as one 
single predicate,
+  // for reducing the size of pushed down predicates and corresponding 
codegen.
+  val right = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  val left = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  // Stop the loop whenever the result exceeds the `maxCnfNodeCount`
+  if (left.size * right.size > maxCnfNodeCount) {
+Seq.empty
+  } else {
+for {x <- left; y <- right} yield Or(x, y)
+  }
+case other => other :: Nil
+  }
+  if (cnf.isEmpty) {
+return Seq.empty
+  }
+  if (resultStack.length != 1) {
+logWarning("The length of CNF conversion result stack is supposed to 
be 1. There might " +

Review comment:
   Well this should never happen.
   But yes let's return Nil





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


cloud-fan commented on a change in pull request #28733:
URL: https://github.com/apache/spark/pull/28733#discussion_r437882318



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##
@@ -198,6 +200,90 @@ trait PredicateHelper {
 case e: Unevaluable => false
 case e => e.children.forall(canEvaluateWithinJoin)
   }
+
+  /**
+   * Convert an expression into conjunctive normal form.
+   * Definition and algorithm: 
https://en.wikipedia.org/wiki/Conjunctive_normal_form
+   * CNF can explode exponentially in the size of the input expression when 
converting Or clauses.
+   * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases.
+   *
+   * @param condition to be conversed into CNF.
+   * @return If the number of expressions exceeds threshold on converting Or, 
return Seq.empty.
+   * If the conversion repeatedly expands nondeterministic 
expressions, return Seq.empty.
+   * Otherwise, return the converted result as sequence of disjunctive 
expressions.
+   */
+  def conjunctiveNormalForm(condition: Expression): Seq[Expression] = {
+val postOrderNodes = postOrderTraversal(condition)
+val resultStack = new mutable.Stack[Seq[Expression]]
+val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount
+// Bottom up approach to get CNF of sub-expressions
+while (postOrderNodes.nonEmpty) {
+  val cnf = postOrderNodes.pop() match {
+case _: And =>
+  val right: Seq[Expression] = resultStack.pop()
+  val left: Seq[Expression] = resultStack.pop()
+  left ++ right
+case _: Or =>
+  // For each side, there is no need to expand predicates of the same 
references.
+  // So here we can aggregate predicates of the same references as one 
single predicate,
+  // for reducing the size of pushed down predicates and corresponding 
codegen.
+  val right = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  val left = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  // Stop the loop whenever the result exceeds the `maxCnfNodeCount`
+  if (left.size * right.size > maxCnfNodeCount) {
+Seq.empty
+  } else {
+for {x <- left; y <- right} yield Or(x, y)
+  }
+case other => other :: Nil
+  }
+  if (cnf.isEmpty) {
+return Seq.empty
+  }
+  if (resultStack.length != 1) {
+logWarning("The length of CNF conversion result stack is supposed to 
be 1. There might " +

Review comment:
   shall we return Nil from here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


gengliangwang commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641747325


   Then how about capture the exception and ask the user to increase the 
related configuration or try loading the page again?
   What is the benefit of this PR to users?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on pull request #27636: [SPARK-30873][CORE][YARN]Handling Node Decommissioning for Yarn cluster manger in Spark

2020-06-09 Thread GitBox


agrawaldevesh commented on pull request #27636:
URL: https://github.com/apache/spark/pull/27636#issuecomment-641747304


   @SaurabhChawla100 , can you briefly update the PR description to reflect how 
work relates to the recently merged in 
https://github.com/apache/spark/pull/27864 ? Perhaps you can leverage or 
enhance the abstractions added in that PR a bit ?
   
   I also don't fully understand the relationship b/w this PR and the original 
decommissioning PR https://github.com/apache/spark/pull/26440. 
   
   I am trying to get a sense of the end state with all these multiple 
decommissioning PR's trying to stretch the framework in different ways. 
   
   @holdenk or @prakharjain09, you recently (greatly) enhanced Spark's 
decommissioning story and I am curious on your thoughts on this PR and how you 
see it fitting it in with the work that you have done. 
   
   From what I can tell: 
   * https://github.com/apache/spark/pull/26440 improved the decommissioning 
for Compute: by not scheduling work on the executors that will be removed soon. 
It seemed to be a bit k8s oriented. 
   * While https://github.com/apache/spark/pull/27864 improves this further by 
eagerly replicating the cached blocks but not like regular shuffle blocks. 
   * This PR https://github.com/apache/spark/pull/27636 has a bit of YARN focus 
and it clears the shuffle state to force an eager re-computation and has 
special handling for ignoring the fetch failures. But it does not seem to be 
building on top of the previous two PR's. 
   
   Thank you for working on this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


gengliangwang commented on a change in pull request #28733:
URL: https://github.com/apache/spark/pull/28733#discussion_r437880281



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##
@@ -198,6 +200,90 @@ trait PredicateHelper {
 case e: Unevaluable => false
 case e => e.children.forall(canEvaluateWithinJoin)
   }
+
+  /**
+   * Convert an expression into conjunctive normal form.
+   * Definition and algorithm: 
https://en.wikipedia.org/wiki/Conjunctive_normal_form
+   * CNF can explode exponentially in the size of the input expression when 
converting Or clauses.
+   * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases.
+   *
+   * @param condition to be conversed into CNF.
+   * @return If the number of expressions exceeds threshold on converting Or, 
return Seq.empty.
+   * If the conversion repeatedly expands nondeterministic 
expressions, return Seq.empty.
+   * Otherwise, return the converted result as sequence of disjunctive 
expressions.
+   */
+  def conjunctiveNormalForm(condition: Expression): Seq[Expression] = {
+val postOrderNodes = postOrderTraversal(condition)
+val resultStack = new mutable.Stack[Seq[Expression]]
+val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount
+// Bottom up approach to get CNF of sub-expressions
+while (postOrderNodes.nonEmpty) {
+  val cnf = postOrderNodes.pop() match {
+case _: And =>
+  val right: Seq[Expression] = resultStack.pop()
+  val left: Seq[Expression] = resultStack.pop()
+  left ++ right
+case _: Or =>
+  // For each side, there is no need to expand predicates of the same 
references.
+  // So here we can aggregate predicates of the same references as one 
single predicate,
+  // for reducing the size of pushed down predicates and corresponding 
codegen.
+  val right = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  val left = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  // Stop the loop whenever the result exceeds the `maxCnfNodeCount`
+  if (left.size * right.size > maxCnfNodeCount) {
+Seq.empty
+  } else {
+for {x <- left; y <- right} yield Or(x, y)
+  }
+case other => other :: Nil
+  }
+  if (cnf.isEmpty) {
+return Seq.empty
+  }
+  if (resultStack.length != 1) {
+logWarning("The length of CNF conversion result stack is supposed to 
be 1. There might " +
+  "be something wrong with CNF conversion.")
+  }
+  resultStack.push(cnf)
+}
+resultStack.top
+  }
+
+  private def aggregateExpressionsOfSameQualifiers(

Review comment:
   Hmm, then the name contains two `By`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-06-09 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-641746170


   @viirya @maropu Can you please help review this PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #28733: [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion

2020-06-09 Thread GitBox


cloud-fan commented on a change in pull request #28733:
URL: https://github.com/apache/spark/pull/28733#discussion_r437879694



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##
@@ -198,6 +200,90 @@ trait PredicateHelper {
 case e: Unevaluable => false
 case e => e.children.forall(canEvaluateWithinJoin)
   }
+
+  /**
+   * Convert an expression into conjunctive normal form.
+   * Definition and algorithm: 
https://en.wikipedia.org/wiki/Conjunctive_normal_form
+   * CNF can explode exponentially in the size of the input expression when 
converting Or clauses.
+   * Use a configuration MAX_CNF_NODE_COUNT to prevent such cases.
+   *
+   * @param condition to be conversed into CNF.
+   * @return If the number of expressions exceeds threshold on converting Or, 
return Seq.empty.
+   * If the conversion repeatedly expands nondeterministic 
expressions, return Seq.empty.
+   * Otherwise, return the converted result as sequence of disjunctive 
expressions.
+   */
+  def conjunctiveNormalForm(condition: Expression): Seq[Expression] = {
+val postOrderNodes = postOrderTraversal(condition)
+val resultStack = new mutable.Stack[Seq[Expression]]
+val maxCnfNodeCount = SQLConf.get.maxCnfNodeCount
+// Bottom up approach to get CNF of sub-expressions
+while (postOrderNodes.nonEmpty) {
+  val cnf = postOrderNodes.pop() match {
+case _: And =>
+  val right: Seq[Expression] = resultStack.pop()
+  val left: Seq[Expression] = resultStack.pop()
+  left ++ right
+case _: Or =>
+  // For each side, there is no need to expand predicates of the same 
references.
+  // So here we can aggregate predicates of the same references as one 
single predicate,
+  // for reducing the size of pushed down predicates and corresponding 
codegen.
+  val right = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  val left = aggregateExpressionsOfSameQualifiers(resultStack.pop())
+  // Stop the loop whenever the result exceeds the `maxCnfNodeCount`
+  if (left.size * right.size > maxCnfNodeCount) {
+Seq.empty
+  } else {
+for {x <- left; y <- right} yield Or(x, y)
+  }
+case other => other :: Nil
+  }
+  if (cnf.isEmpty) {
+return Seq.empty
+  }
+  if (resultStack.length != 1) {
+logWarning("The length of CNF conversion result stack is supposed to 
be 1. There might " +
+  "be something wrong with CNF conversion.")
+  }
+  resultStack.push(cnf)
+}
+resultStack.top
+  }
+
+  private def aggregateExpressionsOfSameQualifiers(

Review comment:
   nit: `groupByExprsByQualifier`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhli1142015 commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


zhli1142015 commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641744773


   > could you describe an end-to-end use case that can reproduce the error 
page in PR description? Does it only happen when leveldb is evicted or UI 
server is
   
   Sure, for our cases, we host many big event files (>200, average level db 
size is 60~70 Mb) in history server, so when we switch pages for different 
applications , it would trigger `HistoryServerDiskManager` ro release disk 
space. then error happened.
   For reproducing in dev machine, you can specify below configuration and run 
HS with two applications, and open the first application job page, then open 
the second one. 
   spark.history.retainedApplications 1
   spark.history.store.maxDiskUsage 10k
   spark.history.store.path d://cache
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641736976







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641736976







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641721923


   **[Test build #123725 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)**
 for PR 28743 at commit 
[`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437872692



##
File path: python/pyspark/sql/pandas/serializers.py
##
@@ -150,15 +151,22 @@ def _create_batch(self, series):
 series = ((s, None) if not isinstance(s, (list, tuple)) else s for s 
in series)
 
 def create_array(s, t):
-mask = s.isnull()
+# Create with __arrow_array__ if the series' backing array 
implements it
+series_array = getattr(s, 'array', s._values)
+if hasattr(series_array, "__arrow_array__"):
+return series_array.__arrow_array__(type=t)
+
 # Ensure timestamp series are in expected form for Spark internal 
representation
 if t is not None and pa.types.is_timestamp(t):
 s = _check_series_convert_timestamps_internal(s, 
self._timezone)
-elif type(s.dtype) == pd.CategoricalDtype:
+elif is_categorical_dtype(s.dtype):

Review comment:
   By the way, this change was made as `CategoricalDtype` is only imported 
into the root pandas namespace after pandas 0.24.0, which was causing 
`AttributeError` when testing with earlier versions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641736450


   **[Test build #123725 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)**
 for PR 28743 at commit 
[`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] siknezevic commented on pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-09 Thread GitBox


siknezevic commented on pull request #27246:
URL: https://github.com/apache/spark/pull/27246#issuecomment-641736377


   > Also, could you add some benchmark classes in 
https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark
 ?
   
   Hello @maropu, 
   I checked available benchmarks and I can see that there is already benchmark 
that can be utilized.
   
   
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
   
   I am using Databrick’s Toolkit for all my testing (10TB, 100TB datasets).  
TPCDSQueryBenchmark is based on Databrick’s Toolkit. I was able to test 
spilling with TPCDSQueryBenchmark benchmark. I executed benchmark in the 
following way:
   
   /opt/spark/bin/spark-submit --class 
org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --conf 
'spark.sql.sortMergeJoinExec.buffer.spill.threshold=6000' --conf 
'spark.sql.sortMergeJoinExec.buffer.in.memory.threshold=1000'  
'/tmp/spark-sql_2.11-2.4.6-SNAPSHOT-tests.jar' --data-location 
'/user/testusera1/tpcds/datasets-1g/sf1-parquet/useDecimal=true,useDate=true,filterNull=false'
 --query-filter 'q14a'
   
   It runs fine and I am able to pass to it Spark config parameters to trigger 
spilling. I believe that we do not need new benchmark. Do you agree? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


gengliangwang commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641735784


   @zhli1142015 sorry I left comments in the code before I read the discussion 
in the PR.
   
   So, before you update the related code,  could you describe an end-to-end 
use case that can reproduce the error page in PR description? Does it only 
happen when leveldb is evicted or UI server is shutdown?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a change in pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing

2020-06-09 Thread GitBox


yaooqinn commented on a change in pull request #28766:
URL: https://github.com/apache/spark/pull/28766#discussion_r437871769



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala
##
@@ -433,4 +433,35 @@ class TimestampFormatterSuite extends 
DatetimeFormatterSuite {
   assert(formatter.format(date(1970, 4, 10)) == "100")
 }
   }
+
+  test("SPARK-31939: Fix Parsing day of year when year field pattern is 
missing") {
+// resolved to queryable LocaleDate or fail directly
+val f0 = TimestampFormatter("-dd-DD", UTC, isParsing = true)

Review comment:
   Sounds good.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a change in pull request #28766: [SPARK-31939][SQL] Fix Parsing day of year when year field pattern is missing

2020-06-09 Thread GitBox


yaooqinn commented on a change in pull request #28766:
URL: https://github.com/apache/spark/pull/28766#discussion_r437871990



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
##
@@ -39,6 +39,18 @@ trait DateTimeFormatterHelper {
 }
   }
 
+  private def verifyLocalDate(
+  accessor: TemporalAccessor, field: ChronoField, candidate: LocalDate): 
Unit = {
+if (accessor.isSupported(field) && candidate.isSupported(field)) {

Review comment:
   For the time being, yes. I can remove this condition

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
##
@@ -39,6 +39,18 @@ trait DateTimeFormatterHelper {
 }
   }
 
+  private def verifyLocalDate(
+  accessor: TemporalAccessor, field: ChronoField, candidate: LocalDate): 
Unit = {
+if (accessor.isSupported(field) && candidate.isSupported(field)) {
+  val actual = accessor.get(field)
+  val expected = candidate.get(field)
+  if (actual != expected) {
+throw new DateTimeException(s"Conflict found: Field $field $actual 
differs from" +
+  s" $field $expected derived from $candidate")

Review comment:
   OK





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641732806







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


gengliangwang commented on a change in pull request #28769:
URL: https://github.com/apache/spark/pull/28769#discussion_r437870387



##
File path: 
common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java
##
@@ -276,6 +276,41 @@ public void testNegativeIndexValues() throws Exception {
 assertEquals(expected, results);
   }
 
+  @Test
+  public void testCloseLevelDBIterator() throws Exception {
+// SPARK-31929: test when LevelDB.close() is called, related 
LevelDBIterators
+// are closed. And files opened by iterators are also closed.
+File dbpathForCloseTest = File
+.createTempFile(

Review comment:
   please change to indents to two spaces in this file





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641732806







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


gengliangwang commented on a change in pull request #28769:
URL: https://github.com/apache/spark/pull/28769#discussion_r437869878



##
File path: 
common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java
##
@@ -189,7 +198,12 @@ public void delete(Class type, Object naturalKey) 
throws Exception {
   @Override
   public Iterator iterator() {
 try {
-  return new LevelDBIterator<>(type, LevelDB.this, this);
+  LevelDBIterator iterator = new LevelDBIterator<>(
+  type,

Review comment:
   Nit: put all the parameters to one line?
   ```
   LevelDBIterator iterator = new LevelDBIterator<>(type, LevelDB.this, 
this);```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dilipbiswal commented on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list

2020-06-09 Thread GitBox


dilipbiswal commented on pull request #28773:
URL: https://github.com/apache/spark/pull/28773#issuecomment-641732487


   LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641689848


   **[Test build #123719 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123719/testReport)**
 for PR 28412 at commit 
[`1e514b9`](https://github.com/apache/spark/commit/1e514b910a56b719a08d6f7a7689a2a53dcc06a5).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a change in pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


gengliangwang commented on a change in pull request #28769:
URL: https://github.com/apache/spark/pull/28769#discussion_r437869753



##
File path: 
common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java
##
@@ -256,6 +275,7 @@ void closeIterator(LevelDBIterator it) throws 
IOException {
   DB _db = this._db.get();
   if (_db != null) {
 it.close();
+iteratorTracker.remove(it);

Review comment:
   shall we remove it even when `_db` is null?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28412: [SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster

2020-06-09 Thread GitBox


SparkQA commented on pull request #28412:
URL: https://github.com/apache/spark/pull/28412#issuecomment-641732044


   **[Test build #123719 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123719/testReport)**
 for PR 28412 at commit 
[`1e514b9`](https://github.com/apache/spark/commit/1e514b910a56b719a08d6f7a7689a2a53dcc06a5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28773:
URL: https://github.com/apache/spark/pull/28773#issuecomment-641725711







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28773:
URL: https://github.com/apache/spark/pull/28773#issuecomment-641725711







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28773:
URL: https://github.com/apache/spark/pull/28773#issuecomment-641651098


   **[Test build #123711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123711/testReport)**
 for PR 28773 at commit 
[`1013ac8`](https://github.com/apache/spark/commit/1013ac8064c1e380f4be4c297c165fae1a20602e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28773: [SPARK-26905][SQL] Add `TYPE` in the ANSI non-reserved list

2020-06-09 Thread GitBox


SparkQA commented on pull request #28773:
URL: https://github.com/apache/spark/pull/28773#issuecomment-641724830


   **[Test build #123711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123711/testReport)**
 for PR 28773 at commit 
[`1013ac8`](https://github.com/apache/spark/commit/1013ac8064c1e380f4be4c297c165fae1a20602e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641722284


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28349/
   Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641722278


   Merged build finished. Test PASSed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641722278







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


SparkQA commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-641721923


   **[Test build #123725 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123725/testReport)**
 for PR 28743 at commit 
[`403f579`](https://github.com/apache/spark/commit/403f5796fdb7decf7c174b28efc6aa6bf2367186).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #27507:
URL: https://github.com/apache/spark/pull/27507#issuecomment-641719947


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123716/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641720102







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641720102







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #27507:
URL: https://github.com/apache/spark/pull/27507#issuecomment-641679815


   **[Test build #123716 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123716/testReport)**
 for PR 27507 at commit 
[`ca6c1c5`](https://github.com/apache/spark/commit/ca6c1c5eef73ae1a3d33f17acebcfcc3d77d9d63).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #27507:
URL: https://github.com/apache/spark/pull/27507#issuecomment-641719937


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28776: [3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options

2020-06-09 Thread GitBox


SparkQA commented on pull request #28776:
URL: https://github.com/apache/spark/pull/28776#issuecomment-641719758


   **[Test build #123723 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123723/testReport)**
 for PR 28776 at commit 
[`f6cca6b`](https://github.com/apache/spark/commit/f6cca6b5163acef655d0c0e3d6cd4848b00314e0).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

2020-06-09 Thread GitBox


SparkQA commented on pull request #27507:
URL: https://github.com/apache/spark/pull/27507#issuecomment-641719783


   **[Test build #123716 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123716/testReport)**
 for PR 27507 at commit 
[`ca6c1c5`](https://github.com/apache/spark/commit/ca6c1c5eef73ae1a3d33f17acebcfcc3d77d9d63).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


SparkQA commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641719773


   **[Test build #123724 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123724/testReport)**
 for PR 28769 at commit 
[`84e9012`](https://github.com/apache/spark/commit/84e9012b49af708ca1b4e5f22f495d8ef38f3122).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #27507: [SPARK-24884][SQL] Support regexp function regexp_extract_all

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #27507:
URL: https://github.com/apache/spark/pull/27507#issuecomment-641719937







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28776: [3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28776:
URL: https://github.com/apache/spark/pull/28776#issuecomment-641717889







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641186609


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


cloud-fan commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641719161


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437858625



##
File path: python/pyspark/sql/tests/test_arrow.py
##
@@ -30,10 +30,14 @@
 pandas_requirement_message, pyarrow_requirement_message
 from pyspark.testing.utils import QuietTest
 from pyspark.util import _exception_message
+from distutils.version import LooseVersion
 
 if have_pandas:
 import pandas as pd
 from pandas.util.testing import assert_frame_equal
+pandas_version = LooseVersion(pd.__version__)
+else:
+pandas_version = LooseVersion("0")

Review comment:
   Nice, will update





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


cloud-fan commented on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641718840


   cc @gengliangwang 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] moskvax commented on a change in pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

2020-06-09 Thread GitBox


moskvax commented on a change in pull request #28743:
URL: https://github.com/apache/spark/pull/28743#discussion_r437858389



##
File path: python/pyspark/sql/pandas/conversion.py
##
@@ -394,10 +394,11 @@ def _create_from_pandas_with_arrow(self, pdf, schema, 
timezone):
 
 # Create the Spark schema from list of names passed in with Arrow types
 if isinstance(schema, (list, tuple)):
-arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
+inferred_types = [pa.infer_type(s, mask=s.isna(), from_pandas=True)
+  for s in (pdf[c] for c in pdf)]
 struct = StructType()
-for name, field in zip(schema, arrow_schema):
-struct.add(name, from_arrow_type(field.type), 
nullable=field.nullable)
+for name, t in zip(schema, inferred_types):
+struct.add(name, from_arrow_type(t), nullable=True)

Review comment:
   `infer_type` only returns a type, not a `field`, which would supposedly 
have nullability information. But it appears that in the implementation of 
`Schema.from_pandas` 
([link](https://github.com/apache/arrow/blob/b058cf0d1c26ad7984c104bb84322cc7dcc66f00/python/pyarrow/types.pxi#L1328)),
 inferring nullability was not actually done and the default `nullable=True` 
would always be returned. So this change is just following the existing 
behaviour of `Schema.from_pandas`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28774:
URL: https://github.com/apache/spark/pull/28774#issuecomment-641718511







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.

2020-06-09 Thread GitBox


AmplabJenkins removed a comment on pull request #28774:
URL: https://github.com/apache/spark/pull/28774#issuecomment-641718511







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #28776: [3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options

2020-06-09 Thread GitBox


AmplabJenkins commented on pull request #28776:
URL: https://github.com/apache/spark/pull/28776#issuecomment-641717889







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.

2020-06-09 Thread GitBox


SparkQA removed a comment on pull request #28774:
URL: https://github.com/apache/spark/pull/28774#issuecomment-641673972


   **[Test build #123714 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123714/testReport)**
 for PR 28774 at commit 
[`c2b6b86`](https://github.com/apache/spark/commit/c2b6b86d2c450d35d9451929eab71eaeed9801c1).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request #28776: [SPARK-31935][SQL] Hadoop file system config should be effective in data source options

2020-06-09 Thread GitBox


gengliangwang opened a new pull request #28776:
URL: https://github.com/apache/spark/pull/28776


   
   
   ### What changes were proposed in this pull request?
   
   Mkae Hadoop file system config effective in data source options.
   
   From `org.apache.hadoop.fs.FileSystem.java`:
   ```
 public static FileSystem get(URI uri, Configuration conf) throws 
IOException {
   String scheme = uri.getScheme();
   String authority = uri.getAuthority();
   
   if (scheme == null && authority == null) { // use default FS
 return get(conf);
   }
   
   if (scheme != null && authority == null) { // no authority
 URI defaultUri = getDefaultUri(conf);
 if (scheme.equals(defaultUri.getScheme())// if scheme matches 
default
 && defaultUri.getAuthority() != null) {  // & default has authority
   return get(defaultUri, conf);  // return default
 }
   }
   
   String disableCacheName = String.format("fs.%s.impl.disable.cache", 
scheme);
   if (conf.getBoolean(disableCacheName, false)) {
 return createFileSystem(uri, conf);
   }
   
   return CACHE.get(uri, conf);
 }
   ```
   Before changes, the file system configurations in data source options are 
not propagated in `DataSource.scala`.
   After changes, we can specify authority and URI schema related 
configurations for scanning file systems.
   
   This problem only exists in data source V1. In V2, we already use 
`sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`.
   ### Why are the changes needed?
   
   Allow users to specify authority and URI schema related Hadoop 
configurations for file source reading.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the file system related Hadoop configuration in data source option will 
be effective on reading.
   
   ### How was this patch tested?
   
   Unit test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #28776: [SPARK-31935][SQL] Hadoop file system config should be effective in data source options

2020-06-09 Thread GitBox


gengliangwang commented on pull request #28776:
URL: https://github.com/apache/spark/pull/28776#issuecomment-641717785


   This PR backports https://github.com/apache/spark/pull/28760 to branch-3.0



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #28774: [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function.

2020-06-09 Thread GitBox


SparkQA commented on pull request #28774:
URL: https://github.com/apache/spark/pull/28774#issuecomment-641717749


   **[Test build #123714 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123714/testReport)**
 for PR 28774 at commit 
[`c2b6b86`](https://github.com/apache/spark/commit/c2b6b86d2c450d35d9451929eab71eaeed9801c1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xccui commented on pull request #28768: [SPARK-31941][CORE] Replace SparkException to NoSuchElementException for applicationInfo in AppStatusStore

2020-06-09 Thread GitBox


xccui commented on pull request #28768:
URL: https://github.com/apache/spark/pull/28768#issuecomment-641714353


   Sorry that I didn't realize the potential impact of using `SparkException` 
or `NoSuchElementException`.
   
   +1 to this change.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhli1142015 edited a comment on pull request #28769: [SPARK-31929][WEBUI] Close leveldbiterator when leveldb.close

2020-06-09 Thread GitBox


zhli1142015 edited a comment on pull request #28769:
URL: https://github.com/apache/spark/pull/28769#issuecomment-641708063


   > Of course relying on finalize is wrong, but I don't think the intent was 
to rely on finalize. Not closing these iterators is a bug. I see one case it 
clearly isn't; there may be others but haven't spotted them. It'd be nice to 
fix them all instead of the change in this patch but we may want to fix what we 
can see and also make the change in this patch for now.
   
   Thanks for your comments, i get your point.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >