date:20211022

[GitHub] [spark] SparkQA commented on pull request #34291: [SPARK-37020][SQL] DS V2 LIMIT push down

2021-10-22 Thread GitBox



SparkQA commented on pull request #34291:
URL: https://github.com/apache/spark/pull/34291#issuecomment-950105537


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49023/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950104918


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34356: [SPARK-36554][PYTHON] Expose make_date expression in functions.scala

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34356:
URL: https://github.com/apache/spark/pull/34356#issuecomment-950104919


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950104918


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950102382


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #34368: [WIP][SPARK-37072][CORE][TEST] Pass all UTs in `repl` with Java 17

2021-10-22 Thread GitBox



LuciferYang commented on a change in pull request #34368:
URL: https://github.com/apache/spark/pull/34368#discussion_r734931211



##
File path: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala
##
@@ -407,6 +417,24 @@ private[spark] object ClosureCleaner extends Logging {
 }
   }
 
+  /**
+   * This method is used to get the final modifier field when use Java 17.
+   */
+  private def getFinalModifiersFieldForJava17(field: Field): Option[Field] = {
+if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_17) &&
+  Modifier.isFinal(field.getModifiers)) {

Review comment:
   ok， will fix it later




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a change in pull request #34368: [WIP][SPARK-37072][CORE][TEST] Pass all UTs in `repl` with Java 17

2021-10-22 Thread GitBox



LuciferYang commented on a change in pull request #34368:
URL: https://github.com/apache/spark/pull/34368#discussion_r734931107



##
File path: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala
##
@@ -407,6 +417,24 @@ private[spark] object ClosureCleaner extends Logging {
 }
   }
 
+  /**
+   * This method is used to get the final modifier field when use Java 17.
+   */
+  private def getFinalModifiersFieldForJava17(field: Field): Option[Field] = {
+if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_17) &&
+  Modifier.isFinal(field.getModifiers)) {
+  val methodGetDeclaredFields0 = classOf[Class[_]]
+.getDeclaredMethod("getDeclaredFields0", classOf[Boolean])
+  methodGetDeclaredFields0.setAccessible(true)
+  val fields = methodGetDeclaredFields0.invoke(classOf[Field], 
false.asInstanceOf[Object])
+.asInstanceOf[Array[Field]]
+  val modifiersFieldOption = fields.find(field => 
"modifiers".equals(field.getName))
+  assert(modifiersFieldOption.isDefined)

Review comment:
   ok， will fix it later

##
File path: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala
##
@@ -394,8 +395,17 @@ private[spark] object ClosureCleaner extends Logging {
 parent = null, outerThis, capturingClass, accessedFields)
 
   val outerField = func.getClass.getDeclaredField("arg$1")
+  // SPARK-37072: When Java 17 is used and `outerField` is read-only,
+  // the content of `outerField` cannot be set by reflect api directly.
+  // But We can remove the `final` modifier of `outerField` before set 
value

Review comment:
   I don't know what the result will be, but I can do an experiment later
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34291: [SPARK-37020][SQL] DS V2 LIMIT push down

2021-10-22 Thread GitBox



SparkQA commented on pull request #34291:
URL: https://github.com/apache/spark/pull/34291#issuecomment-950097032


   **[Test build #144552 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144552/testReport)**
 for PR 34291 at commit 
[`496878c`](https://github.com/apache/spark/commit/496878c041cf40e0e079f4b81db7b1c90690f615).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950091643


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144551/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950091643


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144551/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #34291: [SPARK-37020][SQL] DS V2 LIMIT push down

2021-10-22 Thread GitBox



huaxingao commented on a change in pull request #34291:
URL: https://github.com/apache/spark/pull/34291#discussion_r734929416



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
##
@@ -298,17 +299,22 @@ private[sql] case class JDBCRelation(
   requiredColumns: Array[String],
   finalSchema: StructType,
   filters: Array[Filter],
-  groupByColumns: Option[Array[String]]): RDD[Row] = {
+  groupByColumns: Option[Array[String]],
+  limit: Option[Limit]): RDD[Row] = {
+// If limit is pushed down, only a limited number of rows will be 
returned. PartitionInfo will
+// be ignored and the query will be done in one task.

Review comment:
   Good idea! I will fix this. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



SparkQA removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950063876


   **[Test build #144551 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144551/testReport)**
 for PR 34213 at commit 
[`d2b819d`](https://github.com/apache/spark/commit/d2b819d2e1d6e229aaad5804c5e0417ba157bcf9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950086802


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950068793


   **[Test build #144551 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144551/testReport)**
 for PR 34213 at commit 
[`d2b819d`](https://github.com/apache/spark/commit/d2b819d2e1d6e229aaad5804c5e0417ba157bcf9).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950068687


   CRAN check in SparkR build validates the DESCRIPTION file. The check won't 
validate the values but at least format and etc. Might need to make sure that 
the tests pass anyway although it's unlikely that the tests are broken by this 
change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950063876


   **[Test build #144551 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144551/testReport)**
 for PR 34213 at commit 
[`d2b819d`](https://github.com/apache/spark/commit/d2b819d2e1d6e229aaad5804c5e0417ba157bcf9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950063381


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144550/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950063381


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144550/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



SparkQA removed a comment on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950038749


   **[Test build #144550 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144550/testReport)**
 for PR 34353 at commit 
[`4003f0c`](https://github.com/apache/spark/commit/4003f0c01f769e48b98573247439d1d15248d082).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn edited a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



dchvn edited a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950060026


   CC @HyukjinKwon , updated some nit. May I resolve other improvements in F-UP 
PR latter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950060026


   CC @HyukjinKwon , updated some nit. May I resolve other improvements in F-UP 
PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923937



##
File path: python/pyspark/pandas/frame.py
##
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = 
"left", overwrite: bool = True)
 internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
 self._update_internal_frame(internal, requires_same_anchor=False)
 
+def cov(self, min_periods: Optional[int] = None) -> "DataFrame":

Review comment:
   TODO note updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923848



##
File path: python/pyspark/pandas/tests/test_dataframe.py
##
@@ -6025,6 +6025,64 @@ def test_multi_index_dtypes(self):
 )
 self.assert_eq(psmidx.dtypes, expected)
 
+def test_cov(self):
+# SPARK-36396: Implement DataFrame.cov
+
+# int
+pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", 
"b"])
+psdf = ps.from_pandas(pdf)
+self.assert_eq(pdf.cov(), psdf.cov(), almost=True)
+self.assert_eq(pdf.cov(min_periods=4), psdf.cov(min_periods=4), 
almost=True)
+self.assert_eq(pdf.cov(min_periods=5), psdf.cov(min_periods=5), 
almost=True)

Review comment:
   thanks! updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923406



##
File path: python/pyspark/pandas/frame.py
##
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = 
"left", overwrite: bool = True)
 internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
 self._update_internal_frame(internal, requires_same_anchor=False)
 
+def cov(self, min_periods: Optional[int] = None) -> "DataFrame":
+"""
+Compute pairwise covariance of columns, excluding NA/null values.
+
+Compute the pairwise covariance among the series of a DataFrame.
+The returned data frame is the `covariance matrix
+`__ of the columns
+of the DataFrame.
+
+Both NA and null values are automatically excluded from the
+calculation. (See the note below about bias from missing values.)
+A threshold can be set for the minimum number of
+observations for each value created. Comparisons with observations
+below this threshold will be returned as ``NaN``.
+
+This method is generally used for the analysis of time series data to
+understand the relationship between different measures
+across time.
+
+.. versionadded:: 3.3.0
+
+Parameters
+--
+min_periods : int, optional
+Minimum number of observations required per pair of columns
+to have a valid result.
+
+Returns
+---
+DataFrame
+The covariance matrix of the series of the DataFrame.
+
+See Also
+
+Series.cov : Compute covariance with another Series.
+
+Examples
+
+>>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
+...   columns=['dogs', 'cats'])
+>>> df.cov()
+  dogs  cats
+dogs  0.67 -1.00
+cats -1.00  1.67
+
+>>> np.random.seed(42)
+>>> df = ps.DataFrame(np.random.randn(1000, 5),
+...   columns=['a', 'b', 'c', 'd', 'e'])
+>>> df.cov()
+  a b c d e
+a  0.998438 -0.020161  0.059277 -0.008943  0.014144
+b -0.020161  1.059352 -0.008543 -0.024738  0.009826
+c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
+d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
+e  0.014144  0.009826 -0.000271 -0.013692  0.977795
+
+**Minimum number of periods**
+
+This method also supports an optional ``min_periods`` keyword
+that specifies the required minimum number of non-NA observations for
+each column pair in order to have a valid result:
+
+>>> np.random.seed(42)
+>>> df = pd.DataFrame(np.random.randn(20, 3),
+...   columns=['a', 'b', 'c'])
+>>> df.loc[df.index[:5], 'a'] = np.nan
+>>> df.loc[df.index[5:10], 'b'] = np.nan
+>>> sdf = ps.from_pandas(df)
+>>> sdf.cov(min_periods=12)
+  a b c
+a  0.316741   NaN -0.150812
+b   NaN  1.248003  0.191417
+c -0.150812  0.191417  0.895202
+"""
+min_periods = 1 if min_periods is None else min_periods
+
+# Only compute covariance for Boolean and Numeric except Decimal
+psdf = self[
+[
+col
+for col in self.columns
+if isinstance(self[col].spark.data_type, BooleanType)
+or (
+isinstance(self[col].spark.data_type, NumericType)
+and not isinstance(self[col].spark.data_type, DecimalType)
+)
+]
+]
+
+num_cols = len(psdf.columns)

Review comment:
   Seem we need a quick check ```min_periods > len(self)``` not 
```num_cols``` ? Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

2021-10-22 Thread GitBox



dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923248



##
File path: python/pyspark/pandas/frame.py
##
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = 
"left", overwrite: bool = True)
 internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
 self._update_internal_frame(internal, requires_same_anchor=False)
 
+def cov(self, min_periods: Optional[int] = None) -> "DataFrame":

Review comment:
   Thanks for reviewing!
   I think we could keep the interface consist with pandas, like 
```Series.cov```.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



SparkQA commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950056971


   **[Test build #144550 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144550/testReport)**
 for PR 34353 at commit 
[`4003f0c`](https://github.com/apache/spark/commit/4003f0c01f769e48b98573247439d1d15248d082).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34363: [SPARK-37083][PYTHON] Inline type hints for python/pyspark/accumulators.py

2021-10-22 Thread GitBox



dchvn commented on a change in pull request #34363:
URL: https://github.com/apache/spark/pull/34363#discussion_r734921537



##
File path: python/pyspark/accumulators.py
##
@@ -176,44 +193,44 @@ class AccumulatorParam(object):
 [7.0, 8.0, 9.0]
 """
 
-def zero(self, value):
+def zero(self, value: T) -> T:
 """
 Provide a "zero value" for the type, compatible in dimensions with the
 provided `value` (e.g., a zero vector)
 """
 raise NotImplementedError
 
-def addInPlace(self, value1, value2):
+def addInPlace(self, value1: T, value2: T) -> T:
 """
 Add two values of the accumulator's data type, returning a new value;
 for efficiency, can also update `value1` in place and return it.
 """
 raise NotImplementedError
 
 
-class AddingAccumulatorParam(AccumulatorParam):
+class AddingAccumulatorParam(AccumulatorParam[U]):
 
 """
 An AccumulatorParam that uses the + operators to add values. Designed for 
simple types
 such as integers, floats, and lists. Requires the zero value for the 
underlying type
 as a parameter.
 """
 
-def __init__(self, zero_value):
+def __init__(self, zero_value: U):
 self.zero_value = zero_value
 
-def zero(self, value):
+def zero(self, value: U) -> U:
 return self.zero_value
 
-def addInPlace(self, value1, value2):
-value1 += value2
+def addInPlace(self, value1: U, value2: U) -> U:
+value1 += value2  # type: ignore[operator]

Review comment:
   Yes, I am trying to remove that ignore. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34238: [SPARK-36969][PYTHON] Inline type hints for SparkContext

2021-10-22 Thread GitBox



dchvn commented on pull request #34238:
URL: https://github.com/apache/spark/pull/34238#issuecomment-950052545


   @ueshin Thanks for your help! I updated this PR. Could you take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34363: [SPARK-37083][PYTHON] Inline type hints for python/pyspark/accumulators.py

2021-10-22 Thread GitBox



dchvn commented on a change in pull request #34363:
URL: https://github.com/apache/spark/pull/34363#discussion_r734920972



##
File path: python/pyspark/accumulators.py
##
@@ -264,7 +281,12 @@ def authenticate_and_accum_updates():
 
 class AccumulatorServer(SocketServer.TCPServer):
 
-def __init__(self, server_address, RequestHandlerClass, auth_token):
+def __init__(
+self,
+server_address: Tuple[str, int],
+RequestHandlerClass: Type["socketserver.BaseRequestHandler"],

Review comment:
   I think we need ```class``` type for ```RequestHandlerClass``` so 
```Type["socketserver.BaseRequestHandler"]``` is appropriate.
   As I know:
   ```Type[socketserver.BaseRequestHandler]``` for class
   ```socketserver.BaseRequestHandler ``` for instance
   Did I misunderstand?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34363: [SPARK-37083][PYTHON] Inline type hints for python/pyspark/accumulators.py

2021-10-22 Thread GitBox



dchvn commented on a change in pull request #34363:
URL: https://github.com/apache/spark/pull/34363#discussion_r734920585



##
File path: python/pyspark/accumulators.py
##
@@ -20,20 +20,32 @@
 import struct
 import socketserver as SocketServer
 import threading
+from typing import TypeVar, Generic, Tuple, Callable, Type, Dict, Union, 
TYPE_CHECKING
+
 from pyspark.serializers import read_int, PickleSerializer
 
+if TYPE_CHECKING:
+from pyspark._typing import SupportsIAdd
+import socketserver.BaseRequestHandler  # type: ignore

Review comment:
   I took it from stub file and do not know how to remove it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950051491


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950051491


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



SparkQA commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950048633


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34368: [WIP][SPARK-37072][CORE][TEST] Pass all UTs in `repl` with Java 17

2021-10-22 Thread GitBox



srowen commented on a change in pull request #34368:
URL: https://github.com/apache/spark/pull/34368#discussion_r734914860



##
File path: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala
##
@@ -407,6 +417,24 @@ private[spark] object ClosureCleaner extends Logging {
 }
   }
 
+  /**
+   * This method is used to get the final modifier field when use Java 17.
+   */
+  private def getFinalModifiersFieldForJava17(field: Field): Option[Field] = {
+if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_17) &&
+  Modifier.isFinal(field.getModifiers)) {
+  val methodGetDeclaredFields0 = classOf[Class[_]]
+.getDeclaredMethod("getDeclaredFields0", classOf[Boolean])
+  methodGetDeclaredFields0.setAccessible(true)
+  val fields = methodGetDeclaredFields0.invoke(classOf[Field], 
false.asInstanceOf[Object])
+.asInstanceOf[Array[Field]]
+  val modifiersFieldOption = fields.find(field => 
"modifiers".equals(field.getName))
+  assert(modifiersFieldOption.isDefined)

Review comment:
   require, not assert

##
File path: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala
##
@@ -394,8 +395,17 @@ private[spark] object ClosureCleaner extends Logging {
 parent = null, outerThis, capturingClass, accessedFields)
 
   val outerField = func.getClass.getDeclaredField("arg$1")
+  // SPARK-37072: When Java 17 is used and `outerField` is read-only,
+  // the content of `outerField` cannot be set by reflect api directly.
+  // But We can remove the `final` modifier of `outerField` before set 
value

Review comment:
   I wonder what happens if we don't clear this field in the closure in 
this case - seems kind of risky to do this. That said, who knows what behavior 
differences arise if we don't




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34351: [SPARK-37071][CORE] Make OpenHashMap serialize without reference tracking

2021-10-22 Thread GitBox



srowen commented on a change in pull request #34351:
URL: https://github.com/apache/spark/pull/34351#discussion_r734914783



##
File path: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
##
@@ -149,17 +149,12 @@ class OpenHashMap[K : ClassTag, @specialized(Long, Int, 
Double) V: ClassTag](
 }
   }
 
-  // The following member variables are declared as protected instead of 
private for the
-  // specialization to work (specialized class extends the non-specialized one 
and needs access
-  // to the "private" variables).
-  // They also should have been val's. We use var's because there is a Scala 
compiler bug that

Review comment:
   Seems reasonable to me if it's no longer necessary to workaround in 
order to preserve specialization




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34351: [SPARK-37071][CORE] Make OpenHashMap serialize without reference tracking

2021-10-22 Thread GitBox



srowen commented on a change in pull request #34351:
URL: https://github.com/apache/spark/pull/34351#discussion_r734914728



##
File path: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
##
@@ -149,17 +149,12 @@ class OpenHashMap[K : ClassTag, @specialized(Long, Int, 
Double) V: ClassTag](
 }
   }
 
-  // The following member variables are declared as protected instead of 
private for the
-  // specialization to work (specialized class extends the non-specialized one 
and needs access
-  // to the "private" variables).
-  // They also should have been val's. We use var's because there is a Scala 
compiler bug that

Review comment:
   Oh n/m I read the comments out of order




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34351: [SPARK-37071][CORE] Make OpenHashMap serialize without reference tracking

2021-10-22 Thread GitBox



srowen commented on a change in pull request #34351:
URL: https://github.com/apache/spark/pull/34351#discussion_r734914681



##
File path: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
##
@@ -149,17 +149,12 @@ class OpenHashMap[K : ClassTag, @specialized(Long, Int, 
Double) V: ClassTag](
 }
   }
 
-  // The following member variables are declared as protected instead of 
private for the
-  // specialization to work (specialized class extends the non-specialized one 
and needs access
-  // to the "private" variables).
-  // They also should have been val's. We use var's because there is a Scala 
compiler bug that

Review comment:
   Is there any way to check that the specialization still works - it would 
show up as a bunch of synthetic impl classes in the build?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



srowen commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950044528


   I also can't see the repo, but w/e. It seems fine to make this change - not 
sure any tests will test it anyway.
   This is a fine change for 3.3, but, only Java 11 is supported in 3.2.0 
anyway, note.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



SparkQA commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950043823


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon edited a comment on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon edited a comment on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950042051


   https://github.com/jupyter/drumsticks: I cannot access 😢 . Once it's 
enabled, you could rebase in this PR. That should kick the job


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



HyukjinKwon closed pull request #34353:
URL: https://github.com/apache/spark/pull/34353


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950042051


   https://github.com/jupyter/drumsticks: I cannot access 😂. Once it's enabled, 
you could rebase in this PR. That should kick the job


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950041897


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34298: [SPARK-34960][SQL] Aggregate push down for ORC

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34298:
URL: https://github.com/apache/spark/pull/34298#issuecomment-950040725


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144547/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34298: [SPARK-34960][SQL] Aggregate push down for ORC

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34298:
URL: https://github.com/apache/spark/pull/34298#issuecomment-950040725


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144547/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34298: [SPARK-34960][SQL] Aggregate push down for ORC

2021-10-22 Thread GitBox



SparkQA removed a comment on pull request #34298:
URL: https://github.com/apache/spark/pull/34298#issuecomment-949960083


   **[Test build #144547 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144547/testReport)**
 for PR 34298 at commit 
[`3341440`](https://github.com/apache/spark/commit/334144026416ce81f6e9cfce76b4d5e92a71fa93).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34298: [SPARK-34960][SQL] Aggregate push down for ORC

2021-10-22 Thread GitBox



SparkQA commented on pull request #34298:
URL: https://github.com/apache/spark/pull/34298#issuecomment-950040509


   **[Test build #144547 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144547/testReport)**
 for PR 34298 at commit 
[`3341440`](https://github.com/apache/spark/commit/334144026416ce81f6e9cfce76b4d5e92a71fa93).
* This patch **fails SparkR unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin commented on a change in pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



ueshin commented on a change in pull request #34296:
URL: https://github.com/apache/spark/pull/34296#discussion_r734910279



##
File path: dev/lint-python
##
@@ -124,10 +125,66 @@ function pycodestyle_test {
 fi
 }
 
-function mypy_test {
+
+function mypy_annotation_test {
 local MYPY_REPORT=
 local MYPY_STATUS=
 
+echo "starting mypy annotations test..."
+MYPY_REPORT=$( ($MYPY_BUILD \
+  --config-file python/mypy.ini \
+  --cache-dir /tmp/.mypy_cache/ \
+  python/pyspark) 2>&1)
+MYPY_STATUS=$?
+
+if [ "$MYPY_STATUS" -ne 0 ]; then
+echo "annotations failed mypy checks:"
+echo "$MYPY_REPORT"
+echo "$MYPY_STATUS"
+exit "$MYPY_STATUS"
+else
+echo "annotations passed mypy checks."
+echo
+fi
+}
+
+
+function mypy_data_test {
+local PYTEST_REPORT=
+local PYTEST_STATUS=
+
+echo "starting mypy data test..."
+
+if [ "$(pip freeze | grep -c pytest-mypy-plugins )" -eq 0 ]; then
+  echo "pytest-mypy-plugins missing. Skipping for now."
+  return
+fi
+
+PYTEST_REPORT=$( (MYPYPATH=python $PYTEST_BUILD \
+  -c python/pyproject.toml \
+  --rootdir python \
+  --mypy-only-local-stub \
+  --mypy-ini-file python/mypy.ini \
+  python/pyspark ) 2>&1)
+
+PYTEST_STATUS=$?
+
+if [ "$PYTEST_STATUS" -ne 0 ]; then
+echo "annotations failed data checks:"
+echo "$PYTEST_REPORT"
+echo "$PYTEST_STATUS"
+exit "$PYTEST_STATUS"
+else
+  echo "annotations passed data checks."
+  echo
+fi
+}
+
+
+function mypy_test {
+local PYTEST_REPORT=
+local PYTEST_STATUS=

Review comment:
   We don't need these?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #34367: [SPARK-37099][SQL] Impl a rank-based filter to optimize top-k computation

2021-10-22 Thread GitBox



wangyum commented on pull request #34367:
URL: https://github.com/apache/spark/pull/34367#issuecomment-950039381


   cc @opensky142857


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-948391980


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



SparkQA commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950038749


   **[Test build #144550 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144550/testReport)**
 for PR 34353 at commit 
[`4003f0c`](https://github.com/apache/spark/commit/4003f0c01f769e48b98573247439d1d15248d082).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-950038360


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144548/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950038362


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144549/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-950038360


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144548/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950038362


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144549/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Bidek56 commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



Bidek56 commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950038236


   I did enabled GA but after the failure, not sure to rerun the action, 
besides using an API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Bidek56 commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



Bidek56 commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950037941


   > https://github.com/jupyter/drumsticks repo seems private. Okay, if 
something is affected, then it's fine - I asked because this Java versions in 
description are not used in CRAN and rather just a metadata up to my knowledge.
   
   It's a public [repo](https://github.com/jupyter/docker-stacks/issues/1498)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Bidek56 edited a comment on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



Bidek56 edited a comment on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950037038


   > I agree with this change, and LGTM but I would like to understand how this 
change interacts with something external.
   
   This [issue](https://github.com/jupyter/docker-stacks/issues/1498) is 
affected by the Java 11 limit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950037653


   Mind enabling GitHub Actions in your forked repository? Apache Spark 
leverages contributor's resources in forked repository in their PRs 
(https://github.com/apache/spark/pull/34371/checks?check_run_id=3977747161).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



SparkQA removed a comment on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950004282


   **[Test build #144549 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144549/testReport)**
 for PR 34297 at commit 
[`34559c9`](https://github.com/apache/spark/commit/34559c901b93221f2866030248fcd1f3dfa82e0d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



SparkQA removed a comment on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-95966


   **[Test build #144548 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144548/testReport)**
 for PR 34296 at commit 
[`cf8e27c`](https://github.com/apache/spark/commit/cf8e27c53efd6623ed7a1dfa6fa7068c5cc98678).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950037574


   https://github.com/jupyter/drumsticks repo seems private. Okay, if something 
is affected, then it's fine - I asked because this Java versions in description 
are not used in CRAN and rather just a metadata up to my knowledge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34359: [SPARK-36986][SQL] Improving external schema management flexibility on DataSet and StructType

2021-10-22 Thread GitBox



HyukjinKwon commented on a change in pull request #34359:
URL: https://github.com/apache/spark/pull/34359#discussion_r734908797



##
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##
@@ -511,6 +511,20 @@ class SparkSession private(
 createDataset(data.asScala.toSeq)
   }
 
+  /**
+   * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
method allows
+   * the caller to create externally the InternalRow set, as we as define the 
schema externally.
+   *
+   * @since 3.3.0
+   */
+  def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {

Review comment:
   My concern is that `InternalRow` is not supposed to be exposed to end 
users.
   
   Maybe we should expose `Dataset.ofRows` as a developer API, and do it like:
   
   ```scala
   Dataset.ofRows(
 spark,
 org.apache.spark.sql.execution.LogicalRDD(
   df.queryExecution.executedPlan.output, df.queryExecution.toRdd)(spark))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Bidek56 commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



Bidek56 commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950037038


   > I agree with this change, and LGTM but I would like to understand how this 
change interacts with something external.
   
   This [issue](https://github.com/jupyter/drumsticks/issues/1498) is affected 
by the Java 11 limit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34353: [SPARK-37084][SQL] Set spark.sql.files.openCostInBytes to bytesConf

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34353:
URL: https://github.com/apache/spark/pull/34353#issuecomment-950035199


   ok to test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34292: [SPARK-37017][SQL] Reduce the scope of synchronized to prevent potential deadlock

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34292:
URL: https://github.com/apache/spark/pull/34292#issuecomment-950035138


   Thanks @baibaichen and @chenzhx for addressing my comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon edited a comment on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon edited a comment on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950034756


   I agree with this change, and LGTM but I would like to understand how this 
change interacts with something external.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #34371:
URL: https://github.com/apache/spark/pull/34371#issuecomment-950034756


   I agree with this change, and LGTM but I would like to understand how this 
change interacts with what.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34371: [SPARK-37091][R] Upgrading SystemRequirements to include Java <= 17

2021-10-22 Thread GitBox



HyukjinKwon commented on a change in pull request #34371:
URL: https://github.com/apache/spark/pull/34371#discussion_r734906656



##
File path: R/pkg/DESCRIPTION
##
@@ -13,7 +13,7 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = "aut",
 License: Apache License (== 2.0)
 URL: https://www.apache.org https://spark.apache.org
 BugReports: https://spark.apache.org/contributing.html
-SystemRequirements: Java (>= 8, < 12)
+SystemRequirements: Java (>= 8, <= 17)

Review comment:
   Hey, I would like to double check this. Why does this matter?  What 
issue did you face by this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #25490: [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions

2021-10-22 Thread GitBox



HyukjinKwon commented on pull request #25490:
URL: https://github.com/apache/spark/pull/25490#issuecomment-950034587


   @Bidek56 why does the DESCRIPTION matter? it's just a metadata. Also JDK 17 
isn't fully supported yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



SparkQA commented on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-950034494


   **[Test build #144548 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144548/testReport)**
 for PR 34296 at commit 
[`cf8e27c`](https://github.com/apache/spark/commit/cf8e27c53efd6623ed7a1dfa6fa7068c5cc98678).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



SparkQA commented on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950033141


   **[Test build #144549 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144549/testReport)**
 for PR 34297 at commit 
[`34559c9`](https://github.com/apache/spark/commit/34559c901b93221f2866030248fcd1f3dfa82e0d).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950029775


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-950029774


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950029775


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-950029774


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



SparkQA commented on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-950026540


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



SparkQA commented on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950026171


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #32666: [SPARK-30696][SQL] fromUTCtime and toUTCtime produced wrong result on Daylight Saving Time changes days

2021-10-22 Thread GitBox



github-actions[bot] commented on pull request #32666:
URL: https://github.com/apache/spark/pull/32666#issuecomment-950022442


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



SparkQA commented on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950016605


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



SparkQA commented on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-950014277


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34297: [WIP][SPARK-37022][PYTHON] Use black as a formatter for PySpark

2021-10-22 Thread GitBox



SparkQA commented on pull request #34297:
URL: https://github.com/apache/spark/pull/34297#issuecomment-950004282


   **[Test build #144549 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144549/testReport)**
 for PR 34297 at commit 
[`34559c9`](https://github.com/apache/spark/commit/34559c901b93221f2866030248fcd1f3dfa82e0d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



SparkQA commented on pull request #34296:
URL: https://github.com/apache/spark/pull/34296#issuecomment-95966


   **[Test build #144548 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144548/testReport)**
 for PR 34296 at commit 
[`cf8e27c`](https://github.com/apache/spark/commit/cf8e27c53efd6623ed7a1dfa6fa7068c5cc98678).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #34296: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

2021-10-22 Thread GitBox



zero323 commented on a change in pull request #34296:
URL: https://github.com/apache/spark/pull/34296#discussion_r734874865



##
File path: dev/lint-python
##
@@ -124,39 +125,76 @@ function pycodestyle_test {
 fi
 }
 
-function mypy_test {
-local MYPY_REPORT=
-local MYPY_STATUS=
 
-if ! hash "$MYPY_BUILD" 2> /dev/null; then
-echo "The $MYPY_BUILD command was not found. Skipping for now."
-return
+function mypy_annotation_test {

Review comment:
   Yes, that's correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kazuyukitanimura commented on pull request #33930: [SPARK-36665][SQL] Add more Not operator simplifications

2021-10-22 Thread GitBox



kazuyukitanimura commented on pull request #33930:
URL: https://github.com/apache/spark/pull/33930#issuecomment-949987313


   Hi @sunchao It would be great if you could take one more look when you get a 
chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 edited a comment on pull request #34354: [WIP][SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat

2021-10-22 Thread GitBox



zero323 edited a comment on pull request #34354:
URL: https://github.com/apache/spark/pull/34354#issuecomment-949961121


   > > I think we might have to redefine `ColumnOrName` to fully support these
   > 
   > What's your idea like?
   
   Long story short, I've been looking into different scenarios for using 
aliases and types. Adding inline hints, definitely introduced uses cases that 
we didn't have before, most notably `casts` (which further separate into cases 
where we have generics, bound from function signature and none of the above). 
And there are variances, which pop up here in there. 
   
   I suspect, that some of the cases where invariant generics hit us, might be 
addressed with bounded type vars:
   
   ```python
   from typing import overload, List, TypeVar, Union
   from pyspark.sql import Column
   from pyspark.sql.functions import col
   
   ColumnOrName = Union[str, Column]
   ColumnOrName_ = TypeVar("ColumnOrName_", bound=ColumnOrName)
   
   def array(__cols: List[ColumnOrName_]): ...
   
   column_names = ["a", "b", "c"]
   array(column_names)
   
   columns = [col(x) for x in column_names]
   array(columns)
   
   ```
   
   but these are not universal and there might be some caveats that I don't see 
at the moment.
   
   I hope there will be an opportunity to discuss this stuff in a more 
interactive manner.
   
   
   (_Note_: `ColumnOrName` is still needed for `casts` and other annotations in 
contexts where `ColumnOrName_` would be unbound, like functions without 
`ColumnOrName _` in arguments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #34365: [SPARK-37098][SQL] Alter table properties should invalidate cache

2021-10-22 Thread GitBox



sunchao commented on a change in pull request #34365:
URL: https://github.com/apache/spark/pull/34365#discussion_r734851214



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##
@@ -276,6 +276,7 @@ case class AlterTableSetPropertiesCommand(
   properties = table.properties ++ properties,
   comment = 
properties.get(TableCatalog.PROP_COMMENT).orElse(table.comment))
 catalog.alterTable(newTable)
+catalog.invalidateCachedTable(tableName)

Review comment:
   would it be better to do this inside `SessionCatalog.alterTable`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] c21 commented on a change in pull request #33828: [SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized

2021-10-22 Thread GitBox



c21 commented on a change in pull request #33828:
URL: https://github.com/apache/spark/pull/33828#discussion_r734850205



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -3412,6 +3412,16 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val EXEC_STAGING_DIR = buildConf("spark.sql.exec.stagingDir")
+  .doc("The staging directory of Spark job. Spark uses it to deal with 
files with " +
+"absolute output path, or writing data into partitioned directory when 
" +
+"dynamic partition overwrite mode. " +
+"Default value means staging dir is under table path.")

Review comment:
   nit: `staging dir` -> `staging directory`

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -3412,6 +3412,16 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val EXEC_STAGING_DIR = buildConf("spark.sql.exec.stagingDir")
+  .doc("The staging directory of Spark job. Spark uses it to deal with 
files with " +
+"absolute output path, or writing data into partitioned directory when 
" +

Review comment:
   nit: `when dynamic partition overwrite mode.` -> `when dynamic partition 
overwrite mode is on.` ?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -3412,6 +3412,16 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val EXEC_STAGING_DIR = buildConf("spark.sql.exec.stagingDir")
+  .doc("The staging directory of Spark job. Spark uses it to deal with 
files with " +
+"absolute output path, or writing data into partitioned directory when 
" +
+"dynamic partition overwrite mode. " +
+"Default value means staging dir is under table path.")
+  .version("3.3.0")
+  .internal()
+  .stringConf
+  .createWithDefault(".spark-staging")

Review comment:
   shall we add a `checkValue()` to check the config value is not empty 
string?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #34354: [WIP][SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat

2021-10-22 Thread GitBox



zero323 commented on a change in pull request #34354:
URL: https://github.com/apache/spark/pull/34354#discussion_r734849802



##
File path: python/pyspark/sql/functions.py
##
@@ -1652,7 +1652,19 @@ def expr(str: str) -> Column:
 return Column(sc._jvm.functions.expr(str))
 
 
+@overload
 def struct(*cols: "ColumnOrName") -> Column:
+...
+
+
+@overload
+def struct(__cols: Union[List["ColumnOrName"], Tuple["ColumnOrName", ...]]) -> 
Column:

Review comment:
   > I know it backfires in some contexts, but maybe not here.
   
   But we'd need explicit checks for strings, like
   
   ```python
   if len(cols) == 1 and  isinstance(cols[0], Sequence) and not 
isinstance(cols[0], str):
   cols = cols[0]
   ... 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #34354: [WIP][SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat

2021-10-22 Thread GitBox



zero323 commented on pull request #34354:
URL: https://github.com/apache/spark/pull/34354#issuecomment-949961121


   > > I think we might have to redefine `ColumnOrName` to fully support these
   > 
   > What's your idea like?
   
   Long story short, I've been looking into different scenarios for using 
aliases and types. Adding inline hints, definitely introduced uses cases that 
we didn't have before, most notably `casts` (which further separate into cases 
where we have generics, bound from function signature and none of the above). 
And there are variances, which pop up here in there. 
   
   I suspect, that some of the cases where invariant generics hit us, might be 
addressed with bounded type vars:
   
   ```python
   from typing import overload, List, TypeVar, Union
   from pyspark.sql import Column
   from pyspark.sql.functions import col
   
   ColumnOrName = Union[str, Column]
   ColumnOrName_ = TypeVar("ColumnOrName_", bound=ColumnOrName)
   
   def array(__cols: List[ColumnOrName_]): ...
   
   column_names = ["a", "b", "c"]
   array(column_names)
   
   columns = [col(x) for x in column_names]
   array(columns)
   
   ```
   
   but these are not universal and there might be some caveats that I don't see 
at the moment.
   
   I hope there will be an opportunity to discuss this stuff in a more 
interactive manner.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34298: [SPARK-34960][SQL] Aggregate push down for ORC

2021-10-22 Thread GitBox



SparkQA commented on pull request #34298:
URL: https://github.com/apache/spark/pull/34298#issuecomment-949960083


   **[Test build #144547 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144547/testReport)**
 for PR 34298 at commit 
[`3341440`](https://github.com/apache/spark/commit/334144026416ce81f6e9cfce76b4d5e92a71fa93).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

2021-10-22 Thread GitBox



ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -128,15 +139,21 @@ class FileScanRDD(
   // Sets InputFileBlockHolder for the file block's information
   InputFileBlockHolder.set(currentFile.filePath, currentFile.start, 
currentFile.length)
 
+  resetCurrentIterator()
   if (ignoreMissingFiles || ignoreCorruptFiles) {
 currentIterator = new NextIterator[Object] {
   // The readFunction may read some bytes before consuming the 
iterator, e.g.,
-  // vectorized Parquet reader. Here we use lazy val to delay the 
creation of
-  // iterator so that we will throw exception in `getNext`.
-  private lazy val internalIter = readCurrentFile()
+  // vectorized Parquet reader. Here we use a lazily initialized 
variable to delay the
+  // creation of iterator so that we will throw exception in 
`getNext`.
+  private var internalIter: Iterator[InternalRow] = null

Review comment:
   If the downstream operator never pulls any rows from the iterator, then 
the first time we access `internalIter` will be when `close()` is called. If 
`internalIter` is a `lazy val`, this will trigger a call to 
`readCurrentFile()`, which is unnecessary and may throw. Changing 
`internalIter` from a `lazy val` to a `var` lets us avoid this unnecessary call.
   
   Several tests fail without this change, including `AvroV1Suite`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

2021-10-22 Thread GitBox



sunchao commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734842173



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -128,15 +139,21 @@ class FileScanRDD(
   // Sets InputFileBlockHolder for the file block's information
   InputFileBlockHolder.set(currentFile.filePath, currentFile.start, 
currentFile.length)
 
+  resetCurrentIterator()
   if (ignoreMissingFiles || ignoreCorruptFiles) {
 currentIterator = new NextIterator[Object] {
   // The readFunction may read some bytes before consuming the 
iterator, e.g.,
-  // vectorized Parquet reader. Here we use lazy val to delay the 
creation of
-  // iterator so that we will throw exception in `getNext`.
-  private lazy val internalIter = readCurrentFile()
+  // vectorized Parquet reader. Here we use a lazily initialized 
variable to delay the
+  // creation of iterator so that we will throw exception in 
`getNext`.
+  private var internalIter: Iterator[InternalRow] = null

Review comment:
   Got it. Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

2021-10-22 Thread GitBox



ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -128,15 +139,21 @@ class FileScanRDD(
   // Sets InputFileBlockHolder for the file block's information
   InputFileBlockHolder.set(currentFile.filePath, currentFile.start, 
currentFile.length)
 
+  resetCurrentIterator()
   if (ignoreMissingFiles || ignoreCorruptFiles) {
 currentIterator = new NextIterator[Object] {
   // The readFunction may read some bytes before consuming the 
iterator, e.g.,
-  // vectorized Parquet reader. Here we use lazy val to delay the 
creation of
-  // iterator so that we will throw exception in `getNext`.
-  private lazy val internalIter = readCurrentFile()
+  // vectorized Parquet reader. Here we use a lazily initialized 
variable to delay the
+  // creation of iterator so that we will throw exception in 
`getNext`.
+  private var internalIter: Iterator[InternalRow] = null

Review comment:
   If the downstream operator never pulls any rows from the iterator, then 
the first time we access `internalIter` will be when `close()` is called. If 
`internalIter` is a `lazy val`, this will trigger a call to 
`readCurrentFile()`, which is unnecessary and may throw. Changing 
`internalIter` from a `lazy val` to a `var` lets us avoid this unnecessary call.
   
   Several unit tests fail without this change, including `AvroV1Suite`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

2021-10-22 Thread GitBox



ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734840380



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -128,15 +139,21 @@ class FileScanRDD(
   // Sets InputFileBlockHolder for the file block's information
   InputFileBlockHolder.set(currentFile.filePath, currentFile.start, 
currentFile.length)
 
+  resetCurrentIterator()
   if (ignoreMissingFiles || ignoreCorruptFiles) {
 currentIterator = new NextIterator[Object] {
   // The readFunction may read some bytes before consuming the 
iterator, e.g.,
-  // vectorized Parquet reader. Here we use lazy val to delay the 
creation of
-  // iterator so that we will throw exception in `getNext`.
-  private lazy val internalIter = readCurrentFile()
+  // vectorized Parquet reader. Here we use a lazily initialized 
variable to delay the
+  // creation of iterator so that we will throw exception in 
`getNext`.
+  private var internalIter: Iterator[InternalRow] = null

Review comment:
   If the downstream operator never pulls any rows from the iterator, then 
the first time we access internalIter will be when close() is called. If 
internalIter is a lazy val, this will trigger a call to readCurrentFile(), 
which is unnecessary and may throw. Changing internalIter from a lazy val to a 
var lets us avoid this unnecessary call.
   
   Several unit tests fail without this change, including AvroV1Suite.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34370: [SPARK-37047][SQL][FOLLOWUP] lpad/rpad should fail if parameter str and pad are different types

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34370:
URL: https://github.com/apache/spark/pull/34370#issuecomment-949952367


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144539/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34370: [SPARK-37047][SQL][FOLLOWUP] lpad/rpad should fail if parameter str and pad are different types

2021-10-22 Thread GitBox



AmplabJenkins commented on pull request #34370:
URL: https://github.com/apache/spark/pull/34370#issuecomment-949952367


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144539/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ankurdave commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

2021-10-22 Thread GitBox



ankurdave commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734822511



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -85,6 +85,17 @@ class FileScanRDD(
   private[this] var currentFile: PartitionedFile = null
   private[this] var currentIterator: Iterator[Object] = null
 
+  private def resetCurrentIterator(): Unit = {
+currentIterator match {
+  case iter: NextIterator[_] =>
+iter.closeIfNeeded()
+  case iter: Closeable =>
+iter.close()
+  case _ => // do nothing

Review comment:
   There are currently two cases aside from null:
   - OrcFileFormat produces an ordinary non-Closeable Iterator due to 
unwrapOrcStructs().
   - The user can create a FileScanRDD with an arbitrary readFunction that does 
not return a Closeable Iterator.
   
   It would be ideal if we could disallow these cases and require the iterator 
to be Closeable, but it seems that would require changing public APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

2021-10-22 Thread GitBox



AmplabJenkins removed a comment on pull request #34369:
URL: https://github.com/apache/spark/pull/34369#issuecomment-949951007


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #34369: [SPARK-37089][SQL] Do not register ParquetFileFormat completion listener lazily

2021-10-22 Thread GitBox



sunchao commented on a change in pull request #34369:
URL: https://github.com/apache/spark/pull/34369#discussion_r734831796



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -128,15 +139,21 @@ class FileScanRDD(
   // Sets InputFileBlockHolder for the file block's information
   InputFileBlockHolder.set(currentFile.filePath, currentFile.start, 
currentFile.length)
 
+  resetCurrentIterator()
   if (ignoreMissingFiles || ignoreCorruptFiles) {
 currentIterator = new NextIterator[Object] {
   // The readFunction may read some bytes before consuming the 
iterator, e.g.,
-  // vectorized Parquet reader. Here we use lazy val to delay the 
creation of
-  // iterator so that we will throw exception in `getNext`.
-  private lazy val internalIter = readCurrentFile()
+  // vectorized Parquet reader. Here we use a lazily initialized 
variable to delay the
+  // creation of iterator so that we will throw exception in 
`getNext`.
+  private var internalIter: Iterator[InternalRow] = null

Review comment:
   hm why is this change necessary?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 311 matches

Mail list logo