date:20230920

[GitHub] [spark] Hisoka-X commented on a diff in pull request #42979: [SPARK-45035][SQL] Fix ignoreCorruptFiles with multiline CSV/JSON will report error

2023-09-20 Thread via GitHub



Hisoka-X commented on code in PR #42979:
URL: https://github.com/apache/spark/pull/42979#discussion_r1331553477


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala:
##
@@ -190,12 +191,19 @@ object MultiLineCSVDataSource extends CSVDataSource {
   parsedOptions: CSVOptions): StructType = {
 val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
 csv.flatMap { lines =>
-  val path = new Path(lines.getPath())
-  UnivocityParser.tokenizeStream(
-
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, path),
-shouldDropHeader = false,
-new CsvParser(parsedOptions.asParserSettings),
-encoding = parsedOptions.charset)
+  try {
+val path = new Path(lines.getPath())
+UnivocityParser.tokenizeStream(
+  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, path),
+  shouldDropHeader = false,
+  new CsvParser(parsedOptions.asParserSettings),
+  encoding = parsedOptions.charset)
+  } catch {
+case e @ (_: RuntimeException | _: IOException) if 
parsedOptions.ignoreCorruptFiles =>
+  logWarning(
+s"Skipped the rest of the content in the corrupted file: 
${lines.getPath()}", e)

Review Comment:
   How change it to this?
   ```scala
   case e: TextParsingException if parsedOptions.ignoreCorruptFiles
 && e.getCause.getCause.isInstanceOf[EOFException] =>
   ```
   @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42916: [MiNOR][DOCS] Fix a typo in HashAggregateExec.scala

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #42916:
URL: https://github.com/apache/spark/pull/42916#issuecomment-1727123248

   We have our own logic to detect forked repostiories' github actions run. You 
would need to go to settings in your forked repo, and enable it. For now, seems 
I can't find the Githuh Actions runs in your repo:
   
   ![Screenshot 2023-09-20 at 4 25 57 
PM](https://github.com/apache/spark/assets/6477701/b552aae8-9445-4225-ba74-cc164332782c)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43013:
URL: https://github.com/apache/spark/pull/43013#issuecomment-1727400614

   `pyspark.ml.connect` only supports a small subset of `pyspark.ml`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #43003: [SPARK-45226][PYTHON][DOCS] Refine docstring of `rand/randn`

2023-09-20 Thread via GitHub



zhengruifeng closed pull request #43003: [SPARK-45226][PYTHON][DOCS] Refine 
docstring of `rand/randn`
URL: https://github.com/apache/spark/pull/43003


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #41016:
URL: https://github.com/apache/spark/pull/41016#issuecomment-1727125449

   @BramBoog it has a conflicts against the lastest master branch. You would 
need to resolve the conflicts by git fetch upstream & git rebase upstream/master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43003: [SPARK-45226][PYTHON][DOCS] Refine docstring of `rand/randn`

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43003:
URL: https://github.com/apache/spark/pull/43003#issuecomment-1727125799

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42199:
URL: https://github.com/apache/spark/pull/42199#discussion_r1331581159


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -77,6 +79,11 @@ object SQLExecution {
 }
 val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong
 executionIdToQueryExecution.put(executionId, queryExecution)
+val originalInterruptOnCancel = 
sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL)
+if (originalInterruptOnCancel == null) {
+  val interruptOnCancel = 
sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL)
+  sc.setInterruptOnCancel(interruptOnCancel)

Review Comment:
   This has some side effects: if a task fails many times, we will abort its 
stage and interrupt other running tasks if this is set to true. However, if the 
task does file scans, then interrupting the task means we won't trigger the 
task completion callback which frees file streams.
   
   I think this will lead to memory leak, as runtime execution error is common, 
especially with ansi mode.
   
   How does thriftserver leverage this feature?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331134534


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##
@@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: 
ListQuery)
 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
+  def this(valueAndList: Seq[Expression]) = {

Review Comment:
   All right, reverted in 
https://github.com/apache/spark/pull/42864/commits/8bf64a7f934a537ae00421c54a3b115a25c21168.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dzhigimont commented on pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



dzhigimont commented on PR #40420:
URL: https://github.com/apache/spark/pull/40420#issuecomment-1727680820

   > @dzhigimont Can we just make the CI pass for now? I can help in the 
follow-ups after merging this one.
   > 
   > Seems like the mypy checks is failing for now:
   > 
   > ```
   > starting mypy annotations test...
   > annotations failed mypy checks:
   > python/pyspark/pandas/namespace.py:162: error: Cannot assign multiple 
types to name "_range" without an explicit "Type[...]" annotation  [misc]
   > python/pyspark/pandas/indexes/base.py:2075: error: Unused "type: ignore" 
comment
   > python/pyspark/pandas/indexes/base.py:2145: error: Unused "type: ignore" 
comment
   > Found 3 errors in 2 files (checked 703 source files)
   > ```
   > 
   > To resolve them:
   > 
   > * [ ]  Remove "type: ignore" comment from 
python/pyspark/pandas/indexes/base.py:2075
   > * [ ]  Remove "type: ignore" comment from 
python/pyspark/pandas/indexes/base.py:2145
   > * [ ]  Add "# type: ignore" comment to 
python/pyspark/pandas/namespace.py:162
   
   I can't understand when I added Add "# type: ignore" comment to 
python/pyspark/pandas/namespace.py:162 mypy raised an error unused type: ignore 
when I deleted it raised   `python/pyspark/pandas/namespace.py:161: error: 
Cannot assign multiple types to name "_range" without an explicit "Type[...]" 
annotation  [misc]`
   what do you suggest to me? How i can resolve the problem
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`

2023-09-20 Thread via GitHub



LuciferYang opened a new pull request, #43015:
URL: https://github.com/apache/spark/pull/43015

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331147608


##
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -414,12 +407,13 @@ object functions {
* @group agg_funcs
* @since 1.3.0
*/
-  def count(e: Column): Column = withAggregateFunction {
-e.expr match {
+  def count(e: Column): Column = {
+val withoutStar = e.expr match {
   // Turn count(*) into count(1)

Review Comment:
   No, spark connect doesn't hit such issue.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dzhigimont commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



dzhigimont commented on code in PR #40420:
URL: https://github.com/apache/spark/pull/40420#discussion_r1331412813


##
python/pyspark/pandas/indexes/datetimes.py:
##
@@ -214,28 +215,8 @@ def microsecond(self) -> Index:
 )
 return Index(self.to_series().dt.microsecond)
 
-@property
-def week(self) -> Index:
-"""
-The week ordinal of the year.
-
-.. deprecated:: 3.5.0
-"""
-warnings.warn(
-"`week` is deprecated in 3.5.0 and will be removed in 4.0.0.",
-FutureWarning,
-)
-return Index(self.to_series().dt.week)
-
-@property
-def weekofyear(self) -> Index:
-warnings.warn(
-"`weekofyear` is deprecated in 3.5.0 and will be removed in 
4.0.0.",
-FutureWarning,
-)
-return Index(self.to_series().dt.weekofyear)
-
-weekofyear.__doc__ = week.__doc__
+def isocalendar(self) -> DataFrame:

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun commented on pull request #42993: [SPARK-45231][INFRA] Remove unrecognized and meaningless command about amm from the GA testing workflow.

2023-09-20 Thread via GitHub



panbingkun commented on PR #42993:
URL: https://github.com/apache/spark/pull/42993#issuecomment-1727147543

   cc @vicennial @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42377: [SPARK-44622][SQL][CONNECT] Implement FetchErrorDetails RPC

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42377:
URL: https://github.com/apache/spark/pull/42377#discussion_r1323060159


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/ClientStreamingQuerySuite.scala:
##
@@ -175,6 +175,36 @@ class ClientStreamingQuerySuite extends QueryTest with 
SQLHelper with Logging {
 }
   }
 
+  test("throw exception in streaming") {
+val session = spark
+import session.implicits._
+
+val checkForTwo = udf((value: Int) => {
+  if (value == 2) {
+throw new RuntimeException("Number 2 encountered!")
+  }
+  value
+})
+
+val query = spark.readStream
+  .format("rate")
+  .option("rowsPerSecond", "1")
+  .load()
+  .select(checkForTwo($"value").as("checkedValue"))
+  .writeStream
+  .outputMode("append")
+  .format("console")
+  .start()
+
+val exception = intercept[SparkException] {
+  query.awaitTermination()
+}
+
+assert(
+  exception.getCause.getCause.getCause.getMessage

Review Comment:
   1. org.apache.spark.SparkException: 
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 0 in stage 1384542.0 failed 4 times, most recent failure: 
Lost task 0.3 in stage ...
   2. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 1384542.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1384542.0 ...
   3. org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute 
user defined function (`$Lambda$19844/1935914328`: (int) => int).
   4. org.apache.spark.SparkException: java.lang.RuntimeException: Number 2 
encountered!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-09-20 Thread via GitHub



beliefer commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1331177757


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -279,7 +283,9 @@ case class StaticInvoke(
 inputTypes: Seq[AbstractDataType] = Nil,
 propagateNull: Boolean = true,
 returnNullable: Boolean = true,
-isDeterministic: Boolean = true) extends InvokeLike {
+isDeterministic: Boolean = true,
+scalarFunctionName: Option[String] = None,

Review Comment:
   @cloud-fan @sunchao Although adding `ScalarFunction` as a new parameter is 
convenient, it makes `StaticInvoke` look very strange.
   We already call the `scalarFunc.getClass`, `scalarFunc.resultType()`, 
`scalarFunc.isResultNullable` and `scalarFunc.isDeterministic` before passed 
the new `ScalarFunction`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331301998


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSymmetricHashJoinHelperSuite.scala:
##
@@ -49,7 +44,12 @@ class StreamingSymmetricHashJoinHelperSuite extends 
StreamTest {
   test("only literals") {
 // Literal-only conjuncts end up on the left side because that's the first 
bucket they fit in.
 // There's no semantic reason they couldn't be in any bucket.
-val predicate = (lit(1) < lit(5) && lit(6) < lit(7) && lit(0) === 
lit(-1)).expr
+val predicate =
+  And(
+And(
+  LessThan(lit(1).expr, lit(5).expr),
+  LessThan(lit(6).expr, lit(7).expr)),
+EqualTo(lit(0).expr, lit(-1).expr))

Review Comment:
   fixed in 
https://github.com/apache/spark/pull/42864/commits/580f97b0e458a87509ccf6882075cf7330062d54



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dzhigimont commented on pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



dzhigimont commented on PR #40420:
URL: https://github.com/apache/spark/pull/40420#issuecomment-1727427143

   > @dzhigimont Can we just make the CI pass for now? I can help in the 
follow-ups after merging this one.
   > 
   > Seems like the mypy checks is failing for now:
   > 
   > ```
   > starting mypy annotations test...
   > annotations failed mypy checks:
   > python/pyspark/pandas/namespace.py:162: error: Cannot assign multiple 
types to name "_range" without an explicit "Type[...]" annotation  [misc]
   > python/pyspark/pandas/indexes/base.py:2075: error: Unused "type: ignore" 
comment
   > python/pyspark/pandas/indexes/base.py:2145: error: Unused "type: ignore" 
comment
   > Found 3 errors in 2 files (checked 703 source files)
   > ```
   > 
   > To resolve them:
   > 
   > * [ ]  Remove "type: ignore" comment from 
python/pyspark/pandas/indexes/base.py:2075
   > * [ ]  Remove "type: ignore" comment from 
python/pyspark/pandas/indexes/base.py:2145
   > * [ ]  Add "# type: ignore" comment to 
python/pyspark/pandas/namespace.py:162
   
   Fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml

2023-09-20 Thread via GitHub



WeichenXu123 commented on PR #42382:
URL: https://github.com/apache/spark/pull/42382#issuecomment-1727681685

   @zhengruifeng 
   
   Can we make the interface `saveMetadata` support both `sparkContext` and 
`sparkSession` argument ?
   and in spark repo, we always pass sparkSession as the argument.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331145071


##
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSymmetricHashJoinHelperSuite.scala:
##
@@ -49,7 +44,12 @@ class StreamingSymmetricHashJoinHelperSuite extends 
StreamTest {
   test("only literals") {
 // Literal-only conjuncts end up on the left side because that's the first 
bucket they fit in.
 // There's no semantic reason they couldn't be in any bucket.
-val predicate = (lit(1) < lit(5) && lit(6) < lit(7) && lit(0) === 
lit(-1)).expr
+val predicate =
+  And(
+And(
+  LessThan(lit(1).expr, lit(5).expr),
+  LessThan(lit(6).expr, lit(7).expr)),
+EqualTo(lit(0).expr, lit(-1).expr))

Review Comment:
   shall we use dsl to build expressions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331297747


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/ClientStreamingQuerySuite.scala:
##
@@ -175,6 +175,37 @@ class ClientStreamingQuerySuite extends QueryTest with 
SQLHelper with Logging {
 }
   }
 
+  test("throw exception in streaming") {
+val session = spark
+import session.implicits._
+
+val checkForTwo = udf((value: Int) => {
+  if (value == 2) {
+throw new RuntimeException("Number 2 encountered!")
+  }
+  value
+})
+
+val query = spark.readStream
+  .format("rate")
+  .option("rowsPerSecond", "1")
+  .load()
+  .select(checkForTwo($"value").as("checkedValue"))
+  .writeStream
+  .outputMode("append")
+  .format("console")
+  .start()
+
+val exception = intercept[SparkException] {
+  query.awaitTermination()
+}
+
+assert(
+  exception.getCause.getCause.getCause.getMessage

Review Comment:
   https://github.com/apache/spark/pull/42377#discussion_r1323060159



##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/ClientStreamingQuerySuite.scala:
##
@@ -175,6 +175,37 @@ class ClientStreamingQuerySuite extends QueryTest with 
SQLHelper with Logging {
 }
   }
 
+  test("throw exception in streaming") {
+val session = spark
+import session.implicits._
+
+val checkForTwo = udf((value: Int) => {
+  if (value == 2) {
+throw new RuntimeException("Number 2 encountered!")
+  }
+  value
+})
+
+val query = spark.readStream
+  .format("rate")
+  .option("rowsPerSecond", "1")
+  .load()
+  .select(checkForTwo($"value").as("checkedValue"))
+  .writeStream
+  .outputMode("append")
+  .format("console")
+  .start()
+
+val exception = intercept[SparkException] {
+  query.awaitTermination()
+}
+
+assert(
+  exception.getCause.getCause.getCause.getMessage

Review Comment:
   https://github.com/apache/spark/pull/42377#discussion_r1323060159
   
   I will add the checks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution

2023-09-20 Thread via GitHub



yaooqinn commented on code in PR #42199:
URL: https://github.com/apache/spark/pull/42199#discussion_r1331623060


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -77,6 +79,11 @@ object SQLExecution {
 }
 val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong
 executionIdToQueryExecution.put(executionId, queryExecution)
+val originalInterruptOnCancel = 
sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL)
+if (originalInterruptOnCancel == null) {
+  val interruptOnCancel = 
sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL)
+  sc.setInterruptOnCancel(interruptOnCancel)

Review Comment:
   I don't quite follow, does this PR just leverage a feature that SC already 
have? https://issues.apache.org/jira/browse/HDFS-1208 related?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #43010: [WIP]

2023-09-20 Thread via GitHub



amaliujia opened a new pull request, #43010:
URL: https://github.com/apache/spark/pull/43010

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331143291


##
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##
@@ -723,11 +728,14 @@ object IntegratedUDFTestUtils extends SQLHelper {
   override def builder(e: Seq[Expression]): Expression = {
 assert(e.length == 1, "Defined UDF only has one column")
 val expr = e.head
-assert(expr.resolved, "column should be resolved to use the same type 
" +

Review Comment:
   I'm a bit confused about this change. IIUC we always call `builder` with 
resolved expressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon opened a new pull request, #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



HyukjinKwon opened a new pull request, #43013:
URL: https://github.com/apache/spark/pull/43013

   ### What changes were proposed in this pull request?
   
   This PR proposes to add a couple of notes about which modules are supported 
by Spark Connect.
   
   ### Why are the changes needed?
   
   In order for users to explicitly know which ones are supported in PySpark 
with Spark Connect.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this exposes some notes for PySpark API with Spark Connect.
   
   ### How was this patch tested?
   
   Manually built the site, and checked.
   
   ![Screenshot 2023-09-20 at 6 05 54 
PM](https://github.com/apache/spark/assets/6477701/64355af1-f8b2-46bf-8b2a-3ea519995272)
   
   ![Screenshot 2023-09-20 at 6 05 28 
PM](https://github.com/apache/spark/assets/6477701/c6f2af70-ec6a-4644-a848-30eedd4f56cf)
   
   ![Screenshot 2023-09-20 at 6 05 32 
PM](https://github.com/apache/spark/assets/6477701/b701f5c0-477d-4f7c-9c02-96c307bf14e2)
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #43008: [SPARK-44113][BUILD] Drop Scala 2.12 Support

2023-09-20 Thread via GitHub



LuciferYang opened a new pull request, #43008:
URL: https://github.com/apache/spark/pull/43008

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #43013:
URL: https://github.com/apache/spark/pull/43013#issuecomment-1727296619

   Build: 
https://github.com/HyukjinKwon/spark/actions/runs/6245820107/job/16955248336


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #43002: [SPARK-43498][PS][TESTS] Enable `StatsTests.test_axis_on_dataframe` for pandas 2.0.0.

2023-09-20 Thread via GitHub



HyukjinKwon commented on PR #43002:
URL: https://github.com/apache/spark/pull/43002#issuecomment-1727167048

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331311772


##
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##
@@ -723,11 +728,14 @@ object IntegratedUDFTestUtils extends SQLHelper {
   override def builder(e: Seq[Expression]): Expression = {
 assert(e.length == 1, "Defined UDF only has one column")
 val expr = e.head
-assert(expr.resolved, "column should be resolved to use the same type 
" +

Review Comment:
   In 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonUDFSuite.scala#L41
 the `+` is unresolved now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43007:
URL: https://github.com/apache/spark/pull/43007#issuecomment-1727060848

   Thank you for revising the PR title.
   Since `core` module test passed and I verified manually, let me merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #43002: [SPARK-43498][PS][TESTS] Enable `StatsTests.test_axis_on_dataframe` for pandas 2.0.0.

2023-09-20 Thread via GitHub



HyukjinKwon closed pull request #43002: [SPARK-43498][PS][TESTS] Enable 
`StatsTests.test_axis_on_dataframe` for pandas 2.0.0.
URL: https://github.com/apache/spark/pull/43002


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [SPARK-45232][DOC] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43011:
URL: https://github.com/apache/spark/pull/43011#discussion_r1331200011


##
sql/gen-sql-functions-docs.py:
##
@@ -34,6 +34,8 @@
 "math_funcs", "conditional_funcs", "generator_funcs",
 "predicate_funcs", "string_funcs", "misc_funcs",
 "bitwise_funcs", "conversion_funcs", "csv_funcs",
+"xml_funcs", "lambda_funcs", "collection_funcs",
+"url_funcs", "hash_funcs",

Review Comment:
   manually check with
   ```
   ag --scala 'group = \"' sql
   ```
   
   all groups but `table_funcs` should be here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`

2023-09-20 Thread via GitHub



LuciferYang commented on PR #43015:
URL: https://github.com/apache/spark/pull/43015#issuecomment-1727645034

   cc @dongjoon-hyun FYI
   
   I think this one need to backport to branch-3.4 and branch-3.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



hvanhovell commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331644650


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   Can you rebase and show me it is not an issue anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-09-20 Thread via GitHub



beliefer commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1331177757


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -279,7 +283,9 @@ case class StaticInvoke(
 inputTypes: Seq[AbstractDataType] = Nil,
 propagateNull: Boolean = true,
 returnNullable: Boolean = true,
-isDeterministic: Boolean = true) extends InvokeLike {
+isDeterministic: Boolean = true,
+scalarFunctionName: Option[String] = None,

Review Comment:
   @cloud-fan @sunchao Although adding `ScalarFunction` as a new parameter is 
convenient, it makes StaticInvoke look very strange.
   We already call the `scalarFunc.getClass`, `scalarFunc.resultType()`, 
`scalarFunc.isResultNullable` and `scalarFunc.isDeterministic` before passed 
the new `ScalarFunction`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #43008: [WIP][SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12

2023-09-20 Thread via GitHub



LuciferYang commented on code in PR #43008:
URL: https://github.com/apache/spark/pull/43008#discussion_r1331207849


##
dev/change-scala-version.sh:
##
@@ -19,7 +19,7 @@
 
 set -e
 
-VALID_VERSIONS=( 2.12 2.13 )
+VALID_VERSIONS=( 2.13 )

Review Comment:
   No further modifications were made to this file, just made Scala-2.12 an 
invalid option. We can refactor this script in the future when we need to 
support multiple Scala versions again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331149367


##
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -414,12 +407,13 @@ object functions {
* @group agg_funcs
* @since 1.3.0
*/
-  def count(e: Column): Column = withAggregateFunction {
-e.expr match {
+  def count(e: Column): Column = {
+val withoutStar = e.expr match {
   // Turn count(*) into count(1)

Review Comment:
   
https://github.com/apache/spark/blob/89041a4a8c7b7787fa10f090d4324f20447c4dd3/python/pyspark/sql/connect/functions.py#L1014-L1015
   
   
https://github.com/apache/spark/blob/89041a4a8c7b7787fa10f090d4324f20447c4dd3/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala#L391
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #43011: [SPARK-45232][DOC] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng opened a new pull request, #43011:
URL: https://github.com/apache/spark/pull/43011

   ### What changes were proposed in this pull request?
   Add missing function groups to SQL references:
   - xml_funcs
   - lambda_funcs
   - collection_funcs
   - url_funcs
   - hash_funcsx
   
   Note that this PR doesn't fix `table_funcs`:
   1, `gen-sql-functions-docs.py` doesn't work properly with 
`TableFunctionRegistry`, I took a cursory look but fail to fix it;
   2, table functions except `range` (e.g. `explode`) were already contained in 
`Generator Functions`, not sure we need to show them twice.
   
   
   ### Why are the changes needed?
   when I am referring to the SQL references, I find many functions are missing 
https://spark.apache.org/docs/latest/sql-ref-functions.html.
   
   
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   
   ### How was this patch tested?
   manually check
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331079422


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##
@@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: 
ListQuery)
 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
+  def this(valueAndList: Seq[Expression]) = {

Review Comment:
   I'm not sure if this is useful. `in(a, b, c)` looks pretty weird to me. Can 
we revert it and treat `def in` as a special case? The SQL parser also treat 
`IN` as a special case and has dedicated syntax for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346


##
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala:
##
@@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends 
JsonUtils {
   new SparkArrayIndexOutOfBoundsException(message)),
 errorConstructor[DateTimeException]((message, _) => new 
SparkDateTimeException(message)),
 errorConstructor((message, cause) => new SparkRuntimeException(message, 
cause)),
-errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)))
-
-  private def errorInfoToThrowable(info: ErrorInfo, message: String): 
Option[Throwable] = {
-val classes =
-  mapper.readValue(info.getMetadataOrDefault("classes", "[]"), 
classOf[Array[String]])
+errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)),
+errorConstructor((message, cause) => new SparkException(message, 
cause.orNull)))
+
+  /**
+   * errorsToThrowable reconstructs the exception based on a list of protobuf 
messages
+   * FetchErrorDetailsResponse.Error with un-truncated error messages and 
server-side stacktrace
+   * (if set).
+   */
+  private def errorsToThrowable(
+  errorIdx: Int,
+  errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = {
+
+val error = errors(errorIdx)
+
+val classHierarchy = error.getErrorTypeHierarchyList.asScala
+
+val constructor =
+  classHierarchy
+.flatMap(errorFactory.get)
+.headOption
+.getOrElse((message: String, cause: Option[Throwable]) =>
+  new SparkException(

Review Comment:
   The message pattern is from 
https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one 
seems to be simpler



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non alias-only project case

2023-09-20 Thread via GitHub



cloud-fan closed pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non 
alias-only project case
URL: https://github.com/apache/spark/pull/42971


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43014:
URL: https://github.com/apache/spark/pull/43014#discussion_r1331636427


##
python/pyspark/sql/connect/plan.py:
##
@@ -1049,21 +1049,23 @@ def __init__(self, query: str, args: 
Optional[Union[Dict[str, Any], List]] = Non
 self._query = query
 self._args = args
 
+def __to_expr(self, session: "SparkConnectClient", v: Any) -> 
proto.Expression:

Review Comment:
   I think we'd better rename it `_to_expr`.
   
   [the python 
standard](https://peps.python.org/pep-0008/#method-names-and-instance-variables)
 say:
   
   
   > Note: there is some controversy about the use of __names
   
   
   > In addition, the following special forms using leading or trailing 
underscores are recognized (these can generally be combined with any case 
convention):
   
   > `_single_leading_underscore`: weak “internal use” indicator. E.g. from M 
import * does not import objects whose names start with an underscore.
   > `single_trailing_underscore_`: used by convention to avoid conflicts with 
Python keyword, e.g.
   tkinter.Toplevel(master, class_='ClassName')
   > `__double_leading_underscore`: when naming a class attribute, invokes name 
mangling (inside class FooBar, __boo becomes _FooBar__boo; see below).
   > `__double_leading_and_trailing_underscore__`: “magic” objects or 
attributes that live in user-controlled namespaces. E.g. __init__, __import__ 
or __file__. Never invent such names; only use them as documented.
   



##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -1237,13 +1237,23 @@ def test_sql(self):
 self.assertEqual(1, len(pdf.index))
 
 def test_sql_with_named_args(self):
-df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
-df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
+from pyspark.sql.functions import create_map, lit
+from pyspark.sql.connect.functions import lit as clit
+from pyspark.sql.connect.functions import create_map as ccreate_map

Review Comment:
   
https://github.com/apache/spark/blob/8c27de68756d4b0e5940211340a0b323d808aead/python/pyspark/sql/tests/connect/test_connect_basic.py#L78-L79
   
   I think you can use already imported `SF` and `CF`, to be consistent with 
other tests



##
python/pyspark/sql/connect/plan.py:
##
@@ -1049,21 +1049,23 @@ def __init__(self, query: str, args: 
Optional[Union[Dict[str, Any], List]] = Non
 self._query = query
 self._args = args
 
+def __to_expr(self, session: "SparkConnectClient", v: Any) -> 
proto.Expression:

Review Comment:
   I think we'd better rename it `_to_expr`.
   
   [the python 
standard](https://peps.python.org/pep-0008/#method-names-and-instance-variables)
 say:
   
   
   > Note: there is some controversy about the use of __names
   
   
   > In addition, the following special forms using leading or trailing 
underscores are recognized (these can generally be combined with any case 
convention):
   
   > `_single_leading_underscore`: weak “internal use” indicator. E.g. from M 
import * does not import objects whose names start with an underscore.
   > `single_trailing_underscore_`: used by convention to avoid conflicts with 
Python keyword, e.g.
   tkinter.Toplevel(master, class_='ClassName')
   > `__double_leading_underscore`: when naming a class attribute, invokes name 
mangling (inside class FooBar, __boo becomes _FooBar__boo; see below).
   > `__double_leading_and_trailing_underscore__`: “magic” objects or 
attributes that live in user-controlled namespaces. E.g. __init__, __import__ 
or __file__. Never invent such names; only use them as documented.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on a diff in pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support

2023-09-20 Thread via GitHub



bjornjorgensen commented on code in PR #43005:
URL: https://github.com/apache/spark/pull/43005#discussion_r1331229836


##
.github/workflows/build_coverage.yml:
##
@@ -17,7 +17,7 @@
 # under the License.
 #
 
-name: "Build / Coverage (master, Scala 2.12, Hadoop 3, JDK 8)"
+name: "Build / Coverage (master, Scala 2.12, Hadoop 17, JDK 8)"

Review Comment:
   NO.. dont change hadoop from 3 to 17 but JDK 8 to 17 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43011:
URL: https://github.com/apache/spark/pull/43011#discussion_r1331213799


##
sql/gen-sql-functions-docs.py:
##
@@ -34,6 +34,8 @@
 "math_funcs", "conditional_funcs", "generator_funcs",
 "predicate_funcs", "string_funcs", "misc_funcs",
 "bitwise_funcs", "conversion_funcs", "csv_funcs",
+"xml_funcs", "lambda_funcs", "collection_funcs",
+"url_funcs", "hash_funcs", "struct_funcs",

Review Comment:
   check against
   
https://github.com/apache/spark/blob/37ab190dc5bfa59b4e06af9551c35ab179a05733/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L43-L48
   
   two difference:
   1, `table_funcs`: not support in `gen-sql-functions-docs.py`;
   2, `binary_funcs`: I can not find any function using this group;



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dzhigimont commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0

2023-09-20 Thread via GitHub



dzhigimont commented on code in PR #40420:
URL: https://github.com/apache/spark/pull/40420#discussion_r1331413958


##
python/pyspark/pandas/datetimes.py:
##
@@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]:  # 
type: ignore[no-untyped-def
 def nanosecond(self) -> "ps.Series":
 raise NotImplementedError()
 
-# TODO(SPARK-42617): Support isocalendar.week and replace it.
-# See also https://github.com/pandas-dev/pandas/pull/33595.
-@property
-def week(self) -> "ps.Series":
+def isocalendar(self) -> "ps.DataFrame":
 """
-The week ordinal of the year.
+Calculate year, week, and day according to the ISO 8601 standard.
 
-.. deprecated:: 3.4.0
-"""
-warnings.warn(
-"weekofyear and week have been deprecated.",
-FutureWarning,
-)
-return self._data.spark.transform(lambda c: 
F.weekofyear(c).cast(LongType()))
+.. versionadded:: 4.0.0
 
-@property
-def weekofyear(self) -> "ps.Series":
-return self.week
+Returns
+---
+DataFrame
+With columns year, week and day.
 
-weekofyear.__doc__ = week.__doc__
+.. note:: Returns have int64 type instead of UInt32 as is in pandas 
due to UInt32
+is not supported by spark
+
+Examples
+

Review Comment:
   If you give me 1-2 days I will try to resolve the proble



##
python/pyspark/pandas/datetimes.py:
##
@@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]:  # 
type: ignore[no-untyped-def
 def nanosecond(self) -> "ps.Series":
 raise NotImplementedError()
 
-# TODO(SPARK-42617): Support isocalendar.week and replace it.
-# See also https://github.com/pandas-dev/pandas/pull/33595.
-@property
-def week(self) -> "ps.Series":
+def isocalendar(self) -> "ps.DataFrame":
 """
-The week ordinal of the year.
+Calculate year, week, and day according to the ISO 8601 standard.
 
-.. deprecated:: 3.4.0
-"""
-warnings.warn(
-"weekofyear and week have been deprecated.",
-FutureWarning,
-)
-return self._data.spark.transform(lambda c: 
F.weekofyear(c).cast(LongType()))
+.. versionadded:: 4.0.0
 
-@property
-def weekofyear(self) -> "ps.Series":
-return self.week
+Returns
+---
+DataFrame
+With columns year, week and day.
 
-weekofyear.__doc__ = week.__doc__
+.. note:: Returns have int64 type instead of UInt32 as is in pandas 
due to UInt32
+is not supported by spark
+
+Examples
+

Review Comment:
   If you give me 1-2 days I will try to resolve the problem



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The size of the ErrorInfo exceeds the header limit sometimes in tests. So 
disable this by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346


##
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala:
##
@@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends 
JsonUtils {
   new SparkArrayIndexOutOfBoundsException(message)),
 errorConstructor[DateTimeException]((message, _) => new 
SparkDateTimeException(message)),
 errorConstructor((message, cause) => new SparkRuntimeException(message, 
cause)),
-errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)))
-
-  private def errorInfoToThrowable(info: ErrorInfo, message: String): 
Option[Throwable] = {
-val classes =
-  mapper.readValue(info.getMetadataOrDefault("classes", "[]"), 
classOf[Array[String]])
+errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)),
+errorConstructor((message, cause) => new SparkException(message, 
cause.orNull)))
+
+  /**
+   * errorsToThrowable reconstructs the exception based on a list of protobuf 
messages
+   * FetchErrorDetailsResponse.Error with un-truncated error messages and 
server-side stacktrace
+   * (if set).
+   */
+  private def errorsToThrowable(
+  errorIdx: Int,
+  errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = {
+
+val error = errors(errorIdx)
+
+val classHierarchy = error.getErrorTypeHierarchyList.asScala
+
+val constructor =
+  classHierarchy
+.flatMap(errorFactory.get)
+.headOption
+.getOrElse((message: String, cause: Option[Throwable]) =>
+  new SparkException(

Review Comment:
   Will change. The message pattern is from 
https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one 
seems to be simpler



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331110271


##
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -708,7 +708,7 @@ private[sql] object RelationalGroupedDataset {
 case expr: NamedExpression => expr
 case a: AggregateExpression if 
a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
   UnresolvedAlias(a, Some(Column.generateAlias))
-case u: UnresolvedFunction => UnresolvedAlias(expr, None)
+case e if !e.resolved => UnresolvedAlias(expr, None)

Review Comment:
   I changed the `AggregateExpression` path to `case a: AggregateExpression => 
UnresolvedAlias(a, Some(Column.generateAlias))` but still kept the `case _ if 
!expr.resolved => UnresolvedAlias(expr, None)` because in the 
`plus_one(sum_udf(col("v1")))` case an unresolved `PythonUDF` is passed here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #42996: [SPARK-45224][PYTHON] Add examples w/ map and array as parameters of `sql()`

2023-09-20 Thread via GitHub



MaxGekk commented on PR #42996:
URL: https://github.com/apache/spark/pull/42996#issuecomment-1727192557

   Merging to master. Thank you, @HyukjinKwon and @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331113924


##
python/pyspark/sql/column.py:
##
@@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column":
 
 >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", 
"d"])
 >>> df.select(df.l[slice(1, 3)], df.d['key']).show()
-+--+--+
-|substring(l, 1, 3)|d[key]|
-+--+--+
-|   abc| value|
-+--+--+
++---+--+
+|substr(l, 1, 3)|d[key]|

Review Comment:
   I tried `self.substring` (see 
https://github.com/apache/spark/pull/42864/commits/d611fd5a8da8f0f24d54c3933b3cf7e4dbdabe2c)
 but there is no `self.substring` so if column name doesn't matter then let's 
use `self.substr`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hdaikoku commented on pull request #42426: [SPARK-44756][CORE] Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-09-20 Thread via GitHub



hdaikoku commented on PR #42426:
URL: https://github.com/apache/spark/pull/42426#issuecomment-1727835491

   > I think `SparkUncaughtExceptionHandler` should caught this OOM exception 
and abort the executor.
   
   I'm not sure if I'm following this. For this particular case, OOM was 
actually caught at 
[`AbstractChannelHandlerContext.invokeExceptionCaught`](https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L304).
 So it was not an uncaught exception.
   
   Also, if I understand correctly from the previous discussion in 
https://github.com/apache/spark/pull/37779, the problem of `submit()` is that 
it swallows uncaught exceptions that have been thrown from _inside_ a spawned 
thread, while, here, the OOM was thrown within `submit()` itself even before a 
thread was spawned.
   
   > ```
   >at java.lang.Thread.start0(Native Method)
   >at java.lang.Thread.start(Thread.java:719)
   >at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
   >at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
   >at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
   > ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331113924


##
python/pyspark/sql/column.py:
##
@@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column":
 
 >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", 
"d"])
 >>> df.select(df.l[slice(1, 3)], df.d['key']).show()
-+--+--+
-|substring(l, 1, 3)|d[key]|
-+--+--+
-|   abc| value|
-+--+--+
++---+--+
+|substr(l, 1, 3)|d[key]|

Review Comment:
   I tried `self.substring` (see 
https://github.com/apache/spark/pull/42864/commits/d611fd5a8da8f0f24d54c3933b3cf7e4dbdabe2c)
 but there is no `self.substring` so reverted it in 
https://github.com/apache/spark/pull/42864/commits/6bd03d11b4d345ef0d571976ab80a99002a7ca75.
 If column name doesn't matter then let's use `self.substr`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #43012: [SPARK-45234][PYTHON][DOCS] Refine DocString of `regr_*` functions

2023-09-20 Thread via GitHub



zhengruifeng opened a new pull request, #43012:
URL: https://github.com/apache/spark/pull/43012

   ### What changes were proposed in this pull request?
   Refine DocString of `regr_*` functions
   
   
   ### Why are the changes needed?
   fix the wildcard import
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   
   ### How was this patch tested?
   CI
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331693297


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The change is rebased already. I reenabled the pyspark jvm flag for tests 
but disable this flag in a specific test case instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cxzl25 commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution

2023-09-20 Thread via GitHub



cxzl25 commented on code in PR #42199:
URL: https://github.com/apache/spark/pull/42199#discussion_r1331640480


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -77,6 +79,11 @@ object SQLExecution {
 }
 val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong
 executionIdToQueryExecution.put(executionId, queryExecution)
+val originalInterruptOnCancel = 
sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL)
+if (originalInterruptOnCancel == null) {
+  val interruptOnCancel = 
sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL)
+  sc.setInterruptOnCancel(interruptOnCancel)

Review Comment:
   If there is a problem with the interrupt task, we've encountered these two 
problems in our production environment before.
   
   HDFS-10468. HDFS read ends up ignoring an interrupt
   
   https://issues.apache.org/jira/browse/HDFS-10468
   
   HDFS-10508. DFSInputStream should set thread's interrupt status after 
catching InterruptException from sleep
   
   https://issues.apache.org/jira/browse/HDFS-10508



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42908: [SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer

2023-09-20 Thread via GitHub



LuciferYang commented on PR #42908:
URL: https://github.com/apache/spark/pull/42908#issuecomment-1727552095

   Thanks @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331251817


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   It is disabled by default in python connect tests already: 
https://github.com/apache/spark/blob/c89221b02bb3000f707a31322e6d40b561e527bd/python/pyspark/testing/connectutils.py#L173



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde opened a new pull request, #43009: [SPARK-45230][SQL] Plan sorter for Aggregate after SMJ

2023-09-20 Thread via GitHub



wankunde opened a new pull request, #43009:
URL: https://github.com/apache/spark/pull/43009

   
   
   ### What changes were proposed in this pull request?
   
   This PR could be a followup of https://github.com/apache/spark/pull/42488 
and https://github.com/apache/spark/pull/42557.
   
   If there is an aggregate operator after the SMJ and the grouping expressions 
of aggregate operator contain all the join keys of the streamed side, we can 
add a sorter on the streamed side of the SMJ, so that the aggregate can be 
convert to a SortAggregate which will be faster than HashAggregate.
   
   For example, with table t1(a, b, c) and t2(x, y, z):
   ```
   SELECT a, b, sum(c)
   FROM t1
   JOIN t2
   ON t1.b = t2.y
   GROUP BY a, b
   ```
   
   The optimized plan:
   ```
   Scan(t1)Scan(t2)
   |  |
   |  |
   Exchange 1   Exchange 2
\ /
  \ /
\ /
   SMJ (t1.b = t2.y)
   |
   |
   Aggregate
   ```
   Before this PR, spark EnsureReqirement will add Sorter(t1.b) to the left 
side of SMJ.
   ```
   Scan(t1)Scan(t2)
   |  |
   |  |
   Exchange 1   Exchange 2
\ /
  Sort(t1.b)Sort(t2.y) 
\ /
  SMJ (t1.b = t2.y)
   |
   |
HashAggregate
   ```
   
   If we add a Sort(t1.b, t1.a) to the left side of the SMJ, the following 
aggregate could be convert to SortAggregate, will be faster.
   
   ```
   Scan(t1)Scan(t2)
   |  |
   |  |
   Exchange 1   Exchange 2
\/
Sort(t1.b, t1.a)   Sort(t2.y) 
\ /
 SMJ (t1.b = t2.y)
   |
   |
 SortAggregate
   ```
   
   
   ### Why are the changes needed?
   
   Optimize HashAggregate
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Added UT
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn opened a new pull request, #43016: [SPARK-45077][UI][FOLLOWUP] Update comment to link the forked repo yaooqinn/dagre-d3

2023-09-20 Thread via GitHub



yaooqinn opened a new pull request, #43016:
URL: https://github.com/apache/spark/pull/43016

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43011:
URL: https://github.com/apache/spark/pull/43011#discussion_r1331200011


##
sql/gen-sql-functions-docs.py:
##
@@ -34,6 +34,8 @@
 "math_funcs", "conditional_funcs", "generator_funcs",
 "predicate_funcs", "string_funcs", "misc_funcs",
 "bitwise_funcs", "conversion_funcs", "csv_funcs",
+"xml_funcs", "lambda_funcs", "collection_funcs",
+"url_funcs", "hash_funcs",

Review Comment:
   manually check with
   ```
   ag --scala 'group = \"' sql
   ```
   
   all groups but `table_funcs` should be here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331077927


##
python/pyspark/sql/column.py:
##
@@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column":
 
 >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", 
"d"])
 >>> df.select(df.l[slice(1, 3)], df.d['key']).show()
-+--+--+
-|substring(l, 1, 3)|d[key]|
-+--+--+
-|   abc| value|
-+--+--+
++---+--+
+|substr(l, 1, 3)|d[key]|

Review Comment:
   Ping @peter-toth 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #42994: [SPARK-43433][PS] Match `GroupBy.nth` behavior to the latest Pandas

2023-09-20 Thread via GitHub



itholic commented on code in PR #42994:
URL: https://github.com/apache/spark/pull/42994#discussion_r1331072003


##
python/pyspark/pandas/groupby.py:
##
@@ -1155,14 +1152,32 @@ def nth(self, n: int) -> FrameLike:
 else:
 sdf = sdf.select(*groupkey_names).distinct()
 
-internal = internal.copy(
+agg_columns = []
+if not self._agg_columns_selected:

Review Comment:
   `groupkeys` is going to be a data column instead of index when it's not 
included by `agg_columns_selected`. And index should be kept without updating 
from Pandas 2.0.0 for `nth`. Therefore, we should include the `groupkeys` into 
agg_columns to make it as a data column and create a new `InternalFrame` 
manually.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk opened a new pull request, #43014: [WIP][CONNECT][PYTHON] Support map and array parameters by `sql()`

2023-09-20 Thread via GitHub



MaxGekk opened a new pull request, #43014:
URL: https://github.com/apache/spark/pull/43014

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The size of the ErrorInfo exceeds the header limit sometimes in tests. So 
disable this by default. It should not be an issue after we migrate to use 
FetchErrorDetails RPC



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #42996: [SPARK-45224][PYTHON] Add examples w/ map and array as parameters of `sql()`

2023-09-20 Thread via GitHub



MaxGekk closed pull request #42996: [SPARK-45224][PYTHON] Add examples w/ map 
and array as parameters of `sql()`
URL: https://github.com/apache/spark/pull/42996


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



zhengruifeng commented on code in PR #43011:
URL: https://github.com/apache/spark/pull/43011#discussion_r1331213799


##
sql/gen-sql-functions-docs.py:
##
@@ -34,6 +34,8 @@
 "math_funcs", "conditional_funcs", "generator_funcs",
 "predicate_funcs", "string_funcs", "misc_funcs",
 "bitwise_funcs", "conversion_funcs", "csv_funcs",
+"xml_funcs", "lambda_funcs", "collection_funcs",
+"url_funcs", "hash_funcs", "struct_funcs",

Review Comment:
   check against
   
https://github.com/apache/spark/blob/37ab190dc5bfa59b4e06af9551c35ab179a05733/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L43-L48
   
   two difference:
   1, `table_funcs`: not support in `gen-sql-functions-docs.py`;
   2, `binary_funcs`: we don't have function using this group;



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution

2023-09-20 Thread via GitHub



cloud-fan commented on code in PR #42199:
URL: https://github.com/apache/spark/pull/42199#discussion_r1331684745


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -77,6 +79,11 @@ object SQLExecution {
 }
 val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong
 executionIdToQueryExecution.put(executionId, queryExecution)
+val originalInterruptOnCancel = 
sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL)
+if (originalInterruptOnCancel == null) {
+  val interruptOnCancel = 
sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL)
+  sc.setInterruptOnCancel(interruptOnCancel)

Review Comment:
   > does this PR just leverage a feature that SC already have?
   
   Yes, the problem is, now we enable this feature by default, while this 
feature can lead to memory leak.
   
   That's why I asked, how does thriftserver leverage this feature? After this 
PR, all queries will use this feature, not only thriftserver.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support

2023-09-20 Thread via GitHub



LuciferYang commented on PR #43005:
URL: https://github.com/apache/spark/pull/43005#issuecomment-1727201913

   wait https://github.com/apache/spark/pull/43008


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage

2023-09-20 Thread via GitHub



dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the 
number of drivers waiting in SUBMITTED status in MasterPage
URL: https://github.com/apache/spark/pull/43007


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #43008: [SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12

2023-09-20 Thread via GitHub



LuciferYang commented on PR #43008:
URL: https://github.com/apache/spark/pull/43008#issuecomment-1727210378

   cc @dongjoon-hyun FYI
   
   Do we need to further split this PR ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non alias-only project case

2023-09-20 Thread via GitHub



cloud-fan commented on PR #42971:
URL: https://github.com/apache/spark/pull/42971#issuecomment-1727363616

   The failed streaming test is unrelated, I'm merging it to master/3.5, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The size of the ErrorInfo is about to exceed the header limit even in tests. 
So disable this by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference

2023-09-20 Thread via GitHub



zhengruifeng commented on PR #43013:
URL: https://github.com/apache/spark/pull/43013#issuecomment-1727396313

   `/mllib` in scala, `pyspark.ml` and `pyspark.mllib` in python, don't work on 
connect.
   
   only new module `pyspark.ml.connect` works on connect.
   
   `pyspark.ml` contains many classification/clustering/etc, they are not 
supported in `pyspark.ml.connect`
   
   cc @WeichenXu123 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support

2023-09-20 Thread via GitHub



LuciferYang commented on code in PR #43005:
URL: https://github.com/apache/spark/pull/43005#discussion_r1331245458


##
.github/workflows/build_coverage.yml:
##
@@ -17,7 +17,7 @@
 # under the License.
 #
 
-name: "Build / Coverage (master, Scala 2.12, Hadoop 3, JDK 8)"
+name: "Build / Coverage (master, Scala 2.12, Hadoop 17, JDK 8)"

Review Comment:
   good catch ~ thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



cloud-fan commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1727800532

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331693297


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The change is rebased already. I reenabled the pyspark jvm stacktrace flag 
for tests but disable this flag in a specific test case instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong opened a new pull request, #43017: [SPARK-45239] Reduce default spark.connect.jvmStacktrace.maxSize

2023-09-20 Thread via GitHub



heyihong opened a new pull request, #43017:
URL: https://github.com/apache/spark/pull/43017

   
   
   ### What changes were proposed in this pull request?
   
   
   - Reduce default spark.connect.jvmStacktrace.maxSize
   
   
   ### Why are the changes needed?
   
   
   - `spark.sql.pyspark.jvmStacktrace.enabled` is partially broken (i.e. 
hitting the 8K header limit) and has to be disable in some tests
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   - No
   
   ### How was this patch tested?
   
   
   - Existing tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346


##
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala:
##
@@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends 
JsonUtils {
   new SparkArrayIndexOutOfBoundsException(message)),
 errorConstructor[DateTimeException]((message, _) => new 
SparkDateTimeException(message)),
 errorConstructor((message, cause) => new SparkRuntimeException(message, 
cause)),
-errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)))
-
-  private def errorInfoToThrowable(info: ErrorInfo, message: String): 
Option[Throwable] = {
-val classes =
-  mapper.readValue(info.getMetadataOrDefault("classes", "[]"), 
classOf[Array[String]])
+errorConstructor((message, cause) => new SparkUpgradeException(message, 
cause)),
+errorConstructor((message, cause) => new SparkException(message, 
cause.orNull)))
+
+  /**
+   * errorsToThrowable reconstructs the exception based on a list of protobuf 
messages
+   * FetchErrorDetailsResponse.Error with un-truncated error messages and 
server-side stacktrace
+   * (if set).
+   */
+  private def errorsToThrowable(
+  errorIdx: Int,
+  errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = {
+
+val error = errors(errorIdx)
+
+val classHierarchy = error.getErrorTypeHierarchyList.asScala
+
+val constructor =
+  classHierarchy
+.flatMap(errorFactory.get)
+.headOption
+.getOrElse((message: String, cause: Option[Throwable]) =>
+  new SparkException(

Review Comment:
   This is from 
https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one 
seems to be simpler



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client

2023-09-20 Thread via GitHub



heyihong commented on code in PR #42987:
URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2882,8 +2882,7 @@ object SQLConf {
 "level settings.")
   .version("3.0.0")
   .booleanConf
-  // show full stacktrace in tests but hide in production by default.
-  .createWithDefault(Utils.isTesting)

Review Comment:
   The size of the ErrorInfo exceeds the header limit sometimes in the tests I 
added. So disable this by default. It should not be an issue after we migrate 
to use FetchErrorDetails RPC



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1727813318

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hdaikoku commented on a diff in pull request #42426: [SPARK-44756][CORE] Executor hangs when RetryingBlockTransferor fails to initiate retry

2023-09-20 Thread via GitHub



hdaikoku commented on code in PR #42426:
URL: https://github.com/apache/spark/pull/42426#discussion_r1331742974


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java:
##
@@ -274,7 +287,13 @@ private void handleBlockTransferFailure(String blockId, 
Throwable exception) {
   synchronized (RetryingBlockTransferor.this) {
 if (this == currentListener && outstandingBlocksIds.contains(blockId)) 
{
   if (shouldRetry(exception)) {
-initiateRetry(exception);
+try {
+  initiateRetry(exception);
+} catch (Throwable t) {
+  logger.error("Exception while trying to initiate retry", t);

Review Comment:
   Thank you for the suggestion, refactored accordingly: 
[32498ff](https://github.com/apache/spark/pull/42426/commits/32498ff9a906a21746b619f1ddd316777d188417)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on PR #42949:
URL: https://github.com/apache/spark/pull/42949#issuecomment-1727917945

   Updated fork's master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331785918


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   I don't think we can do something like that here. We are not having 
server-side error response needed to process. We have purely client-side error 
of exception being thrown while grpc consumes iterator of requests. It's not 
possible to extract any information on outside of grpc being called since it 
suppresses error entirely.
   
   I think the only option to make proper exception classes is to preload all 
the artifacts into memory and construct corresponding requests before streaming 
them into grpc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction 
based resolution in SQL Dataset functions
URL: https://github.com/apache/spark/pull/42864


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331894962


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   Could you be more specific on what `Exception` can be thrown here? From your 
PR description, it looks like a GPRC error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution

2023-09-20 Thread via GitHub



yaooqinn commented on code in PR #42199:
URL: https://github.com/apache/spark/pull/42199#discussion_r1331832544


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala:
##
@@ -77,6 +79,11 @@ object SQLExecution {
 }
 val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong
 executionIdToQueryExecution.put(executionId, queryExecution)
+val originalInterruptOnCancel = 
sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL)
+if (originalInterruptOnCancel == null) {
+  val interruptOnCancel = 
sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL)
+  sc.setInterruptOnCancel(interruptOnCancel)

Review Comment:
   STS call cancelJobGroup after query failed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`

2023-09-20 Thread via GitHub



dongjoon-hyun commented on PR #43015:
URL: https://github.com/apache/spark/pull/43015#issuecomment-1727988603

   Merged to master/3.5/3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #43018: [SPARK-45241][INFRA] Use Zulu JDK in java-other-versions pipeline and Java 21

2023-09-20 Thread via GitHub



dongjoon-hyun opened a new pull request, #43018:
URL: https://github.com/apache/spark/pull/43018

   … 
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`

2023-09-20 Thread via GitHub



dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the 
default value of `spark.history.store.hybridStore.diskBackend` in 
`monitoring.md` to `ROCKSDB`
URL: https://github.com/apache/spark/pull/43015


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

2023-09-20 Thread via GitHub



peter-toth commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1728048096

   Thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331901152


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   In my example I'm intercepting FileNotFound error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > Could you be more specific on what Exception can be thrown here? From your 
PR description, it looks like a GPRC error.
   
   Doesn't matter, any exception thrown here would be suppressed by grpc and 
only generic exception "Fail to iterate response" will be visible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331901403


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > What's the behavior after your change when the SPARK_CONNECT logger is 
enabled?
   described in pr
   



##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > What's the behavior after your change when the SPARK_CONNECT logger is 
enabled?
   
   described in pr
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331899412


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   I just think we shouldn't leave users this error message, when they don't 
enable the spark connect logger:
   ```
   status = StatusCode.UNKNOWN
   details = "Exception iterating requests!"
   debug_error_string = "None"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331897585


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   What's the behavior after your change when the SPARK_CONNECT logger is 
enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   > Could you be more specific on what Exception can be thrown here? From your 
PR description, it looks like a GPRC error.
   
   Doesn't matter, any exception thrown here would be suppressed by grpc and 
only generic exception "Exception iterating requests" will be visible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #43014:
URL: https://github.com/apache/spark/pull/43014#discussion_r1331920397


##
python/pyspark/sql/tests/connect/test_connect_basic.py:
##
@@ -1237,13 +1237,23 @@ def test_sql(self):
 self.assertEqual(1, len(pdf.index))
 
 def test_sql_with_named_args(self):
-df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
-df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", 
args={"minId": 7})
+from pyspark.sql.functions import create_map, lit
+from pyspark.sql.connect.functions import lit as clit
+from pyspark.sql.connect.functions import create_map as ccreate_map

Review Comment:
   @zhengruifeng just wondering should `SF` and `CF` be capitalized here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #43011:
URL: https://github.com/apache/spark/pull/43011#discussion_r1331925178


##
sql/gen-sql-functions-docs.py:
##
@@ -34,6 +34,8 @@
 "math_funcs", "conditional_funcs", "generator_funcs",
 "predicate_funcs", "string_funcs", "misc_funcs",
 "bitwise_funcs", "conversion_funcs", "csv_funcs",
+"xml_funcs", "lambda_funcs", "collection_funcs",
+"url_funcs", "hash_funcs", "struct_funcs",

Review Comment:
   QQ: For generator_funcs, do we have documentation for them when used in the 
FROM clause of a query? Functions like explode are typically considered 
table-valued generator functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query

2023-09-20 Thread via GitHub



cdkrot commented on code in PR #42949:
URL: https://github.com/apache/spark/pull/42949#discussion_r1331929198


##
python/pyspark/sql/connect/client/artifact.py:
##
@@ -243,11 +244,15 @@ def _create_requests(
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-return self._add_artifacts(
-chain(
-*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+try:
+yield from self._add_artifacts(
+chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
 )
-)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")

Review Comment:
   >> I just think we shouldn't leave users this error message, when they don't 
enable the spark connect logger:
   
   We can wrap the outside exception in some SparkException saying that 
AddArtifacts failed. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db opened a new pull request, #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed

2023-09-20 Thread via GitHub



allisonwang-db opened a new pull request, #43019:
URL: https://github.com/apache/spark/pull/43019

   
   
   ### What changes were proposed in this pull request?
   
   This PR refines the docstring of `DataFrame.withColumnRenamed` and 
`DataFrame.withColumnsRenamed`.
   
   ### Why are the changes needed?
   
   To improve PySpark documentations.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   doctest
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

2023-09-20 Thread via GitHub



allisonwang-db commented on code in PR #42997:
URL: https://github.com/apache/spark/pull/42997#discussion_r1331943027


##
python/pyspark/sql/connect/functions.py:
##
@@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column:
 if seed is not None:
 return _invoke_function("rand", lit(seed))
 else:
-return _invoke_function("rand")
+return _invoke_function("rand", lit(random.randint(0, sys.maxsize)))

Review Comment:
   These are spark connect functions. Do we also need to update 
`python/pyspark/sql/functions.py`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 177 matches

Mail list logo