date:20220515

[GitHub] [spark] AnywalkerGiser commented on a diff in pull request #36537: [SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



AnywalkerGiser commented on code in PR #36537:
URL: https://github.com/apache/spark/pull/36537#discussion_r873359793


##
python/pyspark/tests/test_rdd.py:
##
@@ -669,6 +670,12 @@ def test_sample(self):
 wr_s21 = rdd.sample(True, 0.4, 21).collect()
 self.assertNotEqual(set(wr_s11), set(wr_s21))
 
+def test_datetime(self):

Review Comment:
   It has been added and modified, please approve it again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk opened a new pull request, #36558: [SPARK-39187][SQL][3.3] Remove `SparkIllegalStateException`

2022-05-15 Thread GitBox



MaxGekk opened a new pull request, #36558:
URL: https://github.com/apache/spark/pull/36558

   ### What changes were proposed in this pull request?
   Remove `SparkIllegalStateException` and replace it by 
`IllegalStateException` where it was used.
   
   This is a backport of https://github.com/apache/spark/pull/36550.
   
   ### Why are the changes needed?
   To improve code maintenance and be consistent to other places where 
`IllegalStateException` is used in illegal states (for instance, see 
https://github.com/apache/spark/pull/36524). After the PR 
https://github.com/apache/spark/pull/36500, the exception is substituted by 
`SparkException` w/ the `INTERNAL_ERROR` error class.
   
   ### Does this PR introduce _any_ user-facing change?
   No. Users shouldn't face to the exception in regular cases.
   
   ### How was this patch tested?
   By running the affected test suites:
   ```
   $ build/sbt "sql/test:testOnly *QueryExecutionErrorsSuite*"
   $ build/sbt "test:testOnly *ArrowUtilsSuite"
   ```
   
   Authored-by: Max Gekk 
   Signed-off-by: Max Gekk 
   (cherry picked from commit 1a90512f605c490255f7b38215c207e64621475b)
   Signed-off-by: Max Gekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AnywalkerGiser closed pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



AnywalkerGiser closed pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] 
Fixed a problem with pyspark serializing pre-1970 datetime in windows
URL: https://github.com/apache/spark/pull/36537


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873346931


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -211,6 +219,15 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with 
PredicateHelper {
 if projectList.forall(_.deterministic) && 
p.references.subsetOf(right.outputSet) &&
   allDuplicateAgnostic(aggExprs) =>
   a.copy(child = p.copy(child = right))
+
+case p @ Project(_, ExtractEquiJoinKeys(LeftOuter, _, rightKeys, _, _, 
left, right, _))
+if right.distinctKeys.exists(_.subsetOf(ExpressionSet(rightKeys))) &&
+  p.references.subsetOf(left.outputSet) =>
+  p.copy(child = left)

Review Comment:
   For a left outer join with only left-side columns being selected, the join 
can only change the result if we can find more than one matched row on the 
right side. If the right side join keys are unique, apparently we can't find 
more than one match. So this optimization LGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36530: [SPARK-39172][SQL] Remove outer join if all output come from streamed side and buffered side keys exist unique key

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36530:
URL: https://github.com/apache/spark/pull/36530#discussion_r873344595


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala:
##
@@ -139,6 +139,14 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
  *   SELECT t1.c1, max(t1.c2) FROM t1 GROUP BY t1.c1
  * }}}
  *
+ * 3. Remove outer join if all output comes from streamed side and the join 
keys from buffered side
+ * exist unique key.

Review Comment:
   it looks a bit weird to talk about stream side and buffer side in the 
logical plan phase. Can we explain this optimization in a different way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36295:
URL: https://github.com/apache/spark/pull/36295#discussion_r873341127


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownOffset.java:
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * A mix-in interface for {@link ScanBuilder}. Data sources can implement this 
interface to
+ * push down OFFSET. Please note that the combination of OFFSET with other 
operations
+ * such as AGGREGATE, GROUP BY, SORT BY, CLUSTER BY, DISTRIBUTE BY, etc. is 
NOT pushed down.

Review Comment:
   BTW we need to update `ScanBuider`'s classdoc for new pushdown support.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36295: [SPARK-38978][SQL] Support push down OFFSET to JDBC data source V2

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36295:
URL: https://github.com/apache/spark/pull/36295#discussion_r873340929


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownOffset.java:
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.read;
+
+import org.apache.spark.annotation.Evolving;
+
+/**
+ * A mix-in interface for {@link ScanBuilder}. Data sources can implement this 
interface to
+ * push down OFFSET. Please note that the combination of OFFSET with other 
operations
+ * such as AGGREGATE, GROUP BY, SORT BY, CLUSTER BY, DISTRIBUTE BY, etc. is 
NOT pushed down.

Review Comment:
   I understand that this is copied from other pushdown interfaces, but I find 
it really hard to follow. We can push down OFFSET with many other operators if 
they follow the operator order we defined in `ScanBuilder`'s class doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #36479: [SPARK-38688][SQL][TESTS] Use error classes in the compilation errors of deserializer

2022-05-15 Thread GitBox



MaxGekk commented on PR #36479:
URL: https://github.com/apache/spark/pull/36479#issuecomment-1127239102

   @panbingkun Since this PR modified error classes, could you backport it to 
branch-3.3, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #36479: [SPARK-38688][SQL][TESTS] Use error classes in the compilation errors of deserializer

2022-05-15 Thread GitBox



MaxGekk closed pull request #36479: [SPARK-38688][SQL][TESTS] Use error classes 
in the compilation errors of deserializer
URL: https://github.com/apache/spark/pull/36479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #36412: [SPARK-39073][SQL] Keep rowCount after hive table partition pruning if table only have hive statistics

2022-05-15 Thread GitBox



cloud-fan closed pull request #36412: [SPARK-39073][SQL] Keep rowCount after 
hive table partition pruning if table only have hive statistics
URL: https://github.com/apache/spark/pull/36412


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #36412: [SPARK-39073][SQL] Keep rowCount after hive table partition pruning if table only have hive statistics

2022-05-15 Thread GitBox



cloud-fan commented on PR #36412:
URL: https://github.com/apache/spark/pull/36412#issuecomment-1127235625

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36412: [SPARK-39073][SQL] Keep rowCount after hive table partition pruning if table only have hive statistics

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36412:
URL: https://github.com/apache/spark/pull/36412#discussion_r87309


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala:
##
@@ -80,10 +80,15 @@ private[sql] class PruneHiveTablePartitions(session: 
SparkSession)
   val colStats = filteredStats.map(_.attributeStats.map { case (attr, 
colStat) =>
 (attr.name, colStat.toCatalogColumnStat(attr.name, attr.dataType))
   })
+  val rowCount = if 
(prunedPartitions.forall(_.stats.flatMap(_.rowCount).exists(_ > 0))) {

Review Comment:
   you are right, I misread the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #36550: [SPARK-39187][SQL] Remove `SparkIllegalStateException`

2022-05-15 Thread GitBox



MaxGekk closed pull request #36550: [SPARK-39187][SQL] Remove 
`SparkIllegalStateException`
URL: https://github.com/apache/spark/pull/36550


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #36121: [SPARK-38836][SQL] Improve the performance of ExpressionSet

2022-05-15 Thread GitBox



cloud-fan closed pull request #36121: [SPARK-38836][SQL] Improve the 
performance of ExpressionSet
URL: https://github.com/apache/spark/pull/36121


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #36550: [SPARK-39187][SQL] Remove `SparkIllegalStateException`

2022-05-15 Thread GitBox



MaxGekk commented on PR #36550:
URL: https://github.com/apache/spark/pull/36550#issuecomment-1127234215

   Merging to master. Thank you, @HyukjinKwon and @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #36121: [SPARK-38836][SQL] Improve the performance of ExpressionSet

2022-05-15 Thread GitBox



cloud-fan commented on PR #36121:
URL: https://github.com/apache/spark/pull/36121#issuecomment-1127234077

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AnywalkerGiser commented on pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



AnywalkerGiser commented on PR #36537:
URL: https://github.com/apache/spark/pull/36537#issuecomment-1127233836

   @HyukjinKwon It hasn't been tested in master, I found the problem in 3.0.1, 
and I can test it in master later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36541:
URL: https://github.com/apache/spark/pull/36541#discussion_r873317698


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala:
##
@@ -82,52 +82,45 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   object SpecialLimits extends Strategy {
 override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
   case ReturnAnswer(rootPlan) => rootPlan match {
-case Limit(IntegerLiteral(limit), Sort(order, true, child))

Review Comment:
   As I mentioned in the PR description, we don't need to plan 
`TakeOrderedAndProjectExec` under `ReturnAnswer`, as we don't have special 
logic for it. It will still be planned in the normal code path, which is `case 
other => planLater(other) :: Nil` and we do have planner rule to match 
`Limit(IntegerLiteral(limit), Sort(order, true, child))`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36541: [SPARK-39180][SQL] Simplify the planning of limit and offset

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36541:
URL: https://github.com/apache/spark/pull/36541#discussion_r873317698


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala:
##
@@ -82,52 +82,45 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   object SpecialLimits extends Strategy {
 override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
   case ReturnAnswer(rootPlan) => rootPlan match {
-case Limit(IntegerLiteral(limit), Sort(order, true, child))

Review Comment:
   As I mentioned in the PR description, we don't need to plan 
`TakeOrderedAndProjectExec` under `ReturnAnswer`, as we don't have special 
logic for it. It will still be planned in the normal code path, which is `case 
other => planLater(other) :: Nil` and we do have planner rule to match 
`Limit(IntegerLiteral(limit), Sort(order, true, child))`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36531: [SPARK-39171][SQL] Unify the Cast expression

2022-05-15 Thread GitBox



cloud-fan commented on code in PR #36531:
URL: https://github.com/apache/spark/pull/36531#discussion_r873314783


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:
##
@@ -2117,7 +2265,9 @@ case class Cast(
 child: Expression,
 dataType: DataType,
 timeZoneId: Option[String] = None,
-override val ansiEnabled: Boolean = SQLConf.get.ansiEnabled)
+override val ansiEnabled: Boolean = SQLConf.get.ansiEnabled,
+fallbackConfKey: String = SQLConf.ANSI_ENABLED.key,
+fallbackConfValue: String = "false")

Review Comment:
   Can we make it an abstract class so that implementations can override? I'm 
really worried about changing the class constructor as many spark plugins use 
`Cast.apply/unapply`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36557: [SPARK-39190][SQL] Provide query context for decimal precision overflow error when WSCG is off

2022-05-15 Thread GitBox



gengliangwang commented on code in PR #36557:
URL: https://github.com/apache/spark/pull/36557#discussion_r873307369


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala:
##
@@ -128,7 +128,7 @@ case class PromotePrecision(child: Expression) extends 
UnaryExpression {
 case class CheckOverflow(

Review Comment:
   Note: we need to change CheckOverflowInSum as well. However, the error 
context is actually empty even when WSCG is available. 
   I need more time for that. I am making this one to catch up the Spark 3.3 
RC2, which is happening soon. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang opened a new pull request, #36557: [SPARK-39190][SQL] Provide query context for decimal precision overflow error when WSCG is off

2022-05-15 Thread GitBox



gengliangwang opened a new pull request, #36557:
URL: https://github.com/apache/spark/pull/36557

   
   
   ### What changes were proposed in this pull request?
   
   Similar to https://github.com/apache/spark/pull/36525, this PR provides 
query context for decimal precision overflow error when WSCG is off
   
   ### Why are the changes needed?
   
   Enhance the runtime error query context of checking decimal overflow. After 
changes, it works when the whole stage codegen is not available.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AnywalkerGiser commented on a diff in pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



AnywalkerGiser commented on code in PR #36537:
URL: https://github.com/apache/spark/pull/36537#discussion_r873305033


##
python/pyspark/sql/types.py:
##
@@ -191,14 +191,25 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = 0.0
+try:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
+except:

Review Comment:
   Sure, I'll change the test again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #36056: [SPARK-36571][SQL] Add an SQLOverwriteHadoopMapReduceCommitProtocol to support all SQL overwrite write data to staging dir

2022-05-15 Thread GitBox



AngersZh commented on PR #36056:
URL: https://github.com/apache/spark/pull/36056#issuecomment-1127179471

   Gentle ping @cloud-fan Could you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #35799: [SPARK-38498][STREAM] Support customized StreamingListener by configuration

2022-05-15 Thread GitBox



AngersZh commented on PR #35799:
URL: https://github.com/apache/spark/pull/35799#issuecomment-1127178691

   Any more suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



HyukjinKwon commented on PR #36537:
URL: https://github.com/apache/spark/pull/36537#issuecomment-1127177497

   @AnywalkerGiser mind creating a PR against `master` branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



HyukjinKwon commented on code in PR #36537:
URL: https://github.com/apache/spark/pull/36537#discussion_r873298180


##
python/pyspark/sql/types.py:
##
@@ -191,14 +191,25 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = 0.0
+try:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
+except:

Review Comment:
   Can we do this with an if-else with OS and negative value check?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



HyukjinKwon commented on code in PR #36537:
URL: https://github.com/apache/spark/pull/36537#discussion_r873297988


##
python/pyspark/tests/test_rdd.py:
##
@@ -669,6 +670,12 @@ def test_sample(self):
 wr_s21 = rdd.sample(True, 0.4, 21).collect()
 self.assertNotEqual(set(wr_s11), set(wr_s21))
 
+def test_datetime(self):

Review Comment:
   Should probably add a comment like:
   
   ```
   SPARK-39176: ... 
   ```
   
   See also https://spark.apache.org/contributing.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



HyukjinKwon commented on code in PR #36537:
URL: https://github.com/apache/spark/pull/36537#discussion_r873297660


##
python/pyspark/sql/types.py:
##
@@ -191,14 +191,25 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = 0.0
+try:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
+except:

Review Comment:
   I think we shouldn't better rely on exception handling for regular data 
parsing path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



HyukjinKwon commented on code in PR #36537:
URL: https://github.com/apache/spark/pull/36537#discussion_r873297554


##
python/pyspark/sql/types.py:
##
@@ -191,14 +191,25 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = 0.0
+try:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
+except:
+# On Windows, the current value is converted to a timestamp 
when the current value is less than 1970
+seconds = (dt - 
datetime.datetime.fromtimestamp(int(time.localtime(0).tm_sec) / 
1000)).total_seconds()

Review Comment:
   IIRC 1970 handling issue is not OS specific problem. It would be great if 
you link some reported issues related to that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #36550: [SPARK-39187][SQL] Remove `SparkIllegalStateException`

2022-05-15 Thread GitBox



AngersZh commented on code in PR #36550:
URL: https://github.com/apache/spark/pull/36550#discussion_r873294811


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -582,8 +582,8 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
  |in operator 
${operator.simpleString(SQLConf.get.maxToStringFields)}
""".stripMargin)
 
-  case _: UnresolvedHint =>
-throw 
QueryExecutionErrors.logicalHintOperatorNotRemovedDuringAnalysisError
+  case _: UnresolvedHint => throw new IllegalStateException(
+"Logical hint operator should be removed during analysis.")

Review Comment:
   How about
   ```
case _: UnresolvedHint =>
   throw new IllegalStateException("Logical hint operator should be 
removed during analysis.")
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer opened a new pull request, #36556: [SPARK-39162][SQL][3.3] Jdbc dialect should decide which function could be pushed down

2022-05-15 Thread GitBox



beliefer opened a new pull request, #36556:
URL: https://github.com/apache/spark/pull/36556

   ### What changes were proposed in this pull request?
   This PR used to back port https://github.com/apache/spark/pull/36521 to 3.3
   
   
   ### Why are the changes needed?
   Let function push-down more flexible.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   Exists tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AnywalkerGiser commented on pull request #36537: [SPARK-39176][PYSPARK][WINDOWS] Fixed a problem with pyspark serializing pre-1970 datetime in windows

2022-05-15 Thread GitBox



AnywalkerGiser commented on PR #36537:
URL: https://github.com/apache/spark/pull/36537#issuecomment-1127149821

   Is there a supervisor for approval?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #36521: [SPARK-39162][SQL] Jdbc dialect should decide which function could be pushed down

2022-05-15 Thread GitBox



beliefer commented on PR #36521:
URL: https://github.com/apache/spark/pull/36521#issuecomment-1127146479

   @cloud-fan @huaxingao Thank you a lot! I will create back port to 3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer closed pull request #36520: [SPARK-38633][SQL] Support push down AnsiCast to JDBC data source V2

2022-05-15 Thread GitBox



beliefer closed pull request #36520: [SPARK-38633][SQL] Support push down 
AnsiCast to JDBC data source V2
URL: https://github.com/apache/spark/pull/36520


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #36531: [SPARK-39171][SQL] Unify the Cast expression

2022-05-15 Thread GitBox



beliefer commented on code in PR #36531:
URL: https://github.com/apache/spark/pull/36531#discussion_r873277888


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala:
##
@@ -275,6 +376,53 @@ object Cast {
   case _ => null
 }
   }
+
+  // Show suggestion on how to complete the disallowed explicit casting with 
built-in type
+  // conversion functions.
+  private def suggestionOnConversionFunctions (
+  from: DataType,
+  to: DataType,
+  functionNames: String): String = {
+// scalastyle:off line.size.limit
+s"""cannot cast ${from.catalogString} to ${to.catalogString}.
+   |To convert values from ${from.catalogString} to ${to.catalogString}, 
you can use $functionNames instead.
+   |""".stripMargin
+// scalastyle:on line.size.limit
+  }
+
+  def typeCheckFailureMessage(
+  from: DataType,
+  to: DataType,
+  fallbackConfKey: Option[String],
+  fallbackConfValue: Option[String]): String =
+(from, to) match {
+  case (_: NumericType, TimestampType) =>
+suggestionOnConversionFunctions(from, to,
+  "functions TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS")
+
+  case (TimestampType, _: NumericType) =>
+suggestionOnConversionFunctions(from, to, "functions 
UNIX_SECONDS/UNIX_MILLIS/UNIX_MICROS")
+
+  case (_: NumericType, DateType) =>
+suggestionOnConversionFunctions(from, to, "function 
DATE_FROM_UNIX_DATE")
+
+  case (DateType, _: NumericType) =>
+suggestionOnConversionFunctions(from, to, "function UNIX_DATE")
+
+  // scalastyle:off line.size.limit
+  case _ if fallbackConfKey.isDefined && fallbackConfValue.isDefined && 
Cast.canCast(from, to) =>
+s"""
+   | cannot cast ${from.catalogString} to ${to.catalogString} with 
ANSI mode on.
+   | If you have to cast ${from.catalogString} to ${to.catalogString}, 
you can set ${fallbackConfKey.get} as ${fallbackConfValue.get}.
+   |""".stripMargin
+  // scalastyle:on line.size.limit
+
+  case _ => s"cannot cast ${from.catalogString} to ${to.catalogString}"
+}
+
+  def ansiCast(child: Expression, dataType: DataType, timeZoneId: 
Option[String] = None): Cast =
+Cast(child, dataType, timeZoneId, true,
+  SQLConf.STORE_ASSIGNMENT_POLICY.key, 
SQLConf.StoreAssignmentPolicy.LEGACY.toString)
 }
 
 abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression 
with NullIntolerant {

Review Comment:
   Yes. `TryCast` extends this parent class too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #36515: [SPARK-39156][SQL] Clean up the usage of `ParquetLogRedirector` in `ParquetFileFormat`.

2022-05-15 Thread GitBox



LuciferYang commented on PR #36515:
URL: https://github.com/apache/spark/pull/36515#issuecomment-1127140077

   thanks @huaxingao @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #36555: [SPARK-39189][PYTHON] Support limit_area parameter in pandas API on Spark

2022-05-15 Thread GitBox



zhengruifeng commented on PR #36555:
URL: https://github.com/apache/spark/pull/36555#issuecomment-1127136933

   @HyukjinKwon Sure! will update soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beobest2 commented on pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-15 Thread GitBox



beobest2 commented on PR #36509:
URL: https://github.com/apache/spark/pull/36509#issuecomment-1127127677

   @bjornjorgensen  Seems like a good idea! I can simply add a column to 
display parameters that only exist in pandas. However, it is necessary to 
discuss whether or not it meets the intent of the document. Any chance of 
confusing pandas users? cc. @HyukjinKwon @Yikun @xinrong-databricks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36555: [SPARK-39189][PYTHON] Support limit_area parameter in pandas API on Spark

2022-05-15 Thread GitBox



HyukjinKwon commented on PR #36555:
URL: https://github.com/apache/spark/pull/36555#issuecomment-1127098019

   @zhengruifeng mind showing the example of this argument usage in the PR 
description?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #36554: [SPARK-39186][PYTHON][FOLLOWUP] Improve the numerical stability of pandas-on-Spark's skewness

2022-05-15 Thread GitBox



HyukjinKwon closed pull request #36554: [SPARK-39186][PYTHON][FOLLOWUP] Improve 
the numerical stability of pandas-on-Spark's skewness
URL: https://github.com/apache/spark/pull/36554


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36554: [SPARK-39186][PYTHON][FOLLOWUP] Improve the numerical stability of pandas-on-Spark's skewness

2022-05-15 Thread GitBox



HyukjinKwon commented on PR #36554:
URL: https://github.com/apache/spark/pull/36554#issuecomment-1127097656

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35357: [SPARK-21195][CORE] MetricSystem should pick up dynamically registered metrics in sources

2022-05-15 Thread GitBox



github-actions[bot] closed pull request #35357: [SPARK-21195][CORE] 
MetricSystem should pick up dynamically registered metrics in sources
URL: https://github.com/apache/spark/pull/35357


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #36509: [SPARK-38961][PYTHON][DOCS] Enhance to automatically generate the the pandas API support list

2022-05-15 Thread GitBox



bjornjorgensen commented on PR #36509:
URL: https://github.com/apache/spark/pull/36509#issuecomment-1127032945

   Yes, very good. 
   I was thinking, pandas API on spark has some more options then pandas have.
   Like to_json() have `ignoreNullFields=True` and `num_files=1`
   
   Can we add another column for the extra things or? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tiagovrtr commented on pull request #33675: [SPARK-27997][K8S] Add support for kubernetes OAuth Token refresh

2022-05-15 Thread GitBox



tiagovrtr commented on PR #33675:
URL: https://github.com/apache/spark/pull/33675#issuecomment-1126996196

   this patch seems only to bring the latest changes from master, anything else 
to do here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #36512: [SPARK-39152][CORE] Deregistering disk persisted local RDD blocks in case of IO related errors

2022-05-15 Thread GitBox



mridulm commented on code in PR #36512:
URL: https://github.com/apache/spark/pull/36512#discussion_r873204359


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -933,10 +933,29 @@ private[spark] class BlockManager(
   })
   Some(new BlockResult(ci, DataReadMethod.Memory, info.size))
 } else if (level.useDisk && diskStore.contains(blockId)) {
-  try {
-val diskData = diskStore.getBytes(blockId)
-val iterToReturn: Iterator[Any] = {
-  if (level.deserialized) {
+  var retryCount = 0
+  val retryLimit = 3

Review Comment:
   My main concern is, usually with bad disks/etc, the reads can take an 
inordinate amount of delay (due to various layers down retrying, trying to 
recover) - so a read which should typically take a few ms can go into minutes 
or higher: hence why I want to understand how to estimate/configure this.
   
   One option is to make it a private config and make it user configurable - 
with 3 (or 2 ?) as the default. Thoughts ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #36479: [SPARK-38688][SQL][TESTS] Use error classes in the compilation errors of deserializer

2022-05-15 Thread GitBox



MaxGekk commented on code in PR #36479:
URL: https://github.com/apache/spark/pull/36479#discussion_r873201532


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -147,14 +147,17 @@ object QueryCompilationErrors extends QueryErrorsBase {
   dataType: DataType, desiredType: String): Throwable = {
 val quantifier = if (desiredType.equals("array")) "an" else "a"
 new AnalysisException(
-  s"need $quantifier $desiredType field but got " + dataType.catalogString)
+  errorClass = "UNSUPPORTED_DESERIALIZER",
+  messageParameters =
+Array("DATA_TYPE_MISMATCH", quantifier, desiredType, 
toSQLType(dataType)))

Review Comment:
   Please, double quote `desiredType` and upper case it for consistency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #36515: [SPARK-39156][SQL] Clean up the usage of `ParquetLogRedirector` in `ParquetFileFormat`.

2022-05-15 Thread GitBox



huaxingao commented on PR #36515:
URL: https://github.com/apache/spark/pull/36515#issuecomment-1126965449

   Thanks! Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao closed pull request #36515: [SPARK-39156][SQL] Clean up the usage of `ParquetLogRedirector` in `ParquetFileFormat`.

2022-05-15 Thread GitBox



huaxingao closed pull request #36515: [SPARK-39156][SQL] Clean up the usage of 
`ParquetLogRedirector` in `ParquetFileFormat`.
URL: https://github.com/apache/spark/pull/36515


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #36555: [SPARK-39189][PYTHON] interpolate supports limit_area

2022-05-15 Thread GitBox



zhengruifeng opened a new pull request, #36555:
URL: https://github.com/apache/spark/pull/36555

   ### What changes were proposed in this pull request?
   interpolate supports param `limit_area`
   
   ### Why are the changes needed?
   to increase api coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, one param added
   
   
   ### How was this patch tested?
   updated UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

50 matches

Mail list logo