[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


LuciferYang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031155887


##
project/SparkBuild.scala:
##
@@ -607,9 +607,25 @@ object SparkParallelTestGrouping {
 
 object Core {
   import scala.sys.process.Process
+  import BuildCommons.protoVersion
   def buildenv = Process(Seq("uname")).!!.trim.replaceFirst("[^A-Za-z0-9].*", 
"").toLowerCase
   def bashpath = Process(Seq("where", 
"bash")).!!.split("[\r\n]+").head.replace('\\', '/')
   lazy val settings = Seq(
+// Setting version for the protobuf compiler. This has to be propagated to 
every sub-project
+// even if the project is not using it.
+PB.protocVersion := BuildCommons.protoVersion,
+// For some reason the resolution from the imported Maven build does not 
work for some
+// of these dependendencies that we need to shade later on.
+libraryDependencies ++= {
+  Seq(
+"io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion 
asProtocPlugin(),
+"com.google.protobuf" % "protobuf-java" % protoVersion % "protobuf"
+  )
+},
+(Compile / PB.targets) := Seq(
+  PB.gens.java -> (Compile / sourceManaged).value,

Review Comment:
   Also need `assembly` and `rename` for protobuf?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-11-23 Thread GitBox


EnricoMi commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1031150953


##
python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py:
##
@@ -165,100 +148,191 @@ def merge_pandas(lft, _):
 )
 
 def test_apply_in_pandas_not_returning_pandas_dataframe(self):
-left = self.data1
-right = self.data2
+self._test_merge_error(
+fn=lambda lft, rgt: lft.size + rgt.size,
+error_class=PythonException,
+error_message_regex="Return type of the user-defined function "
+"should be pandas.DataFrame, but is ",
+)
+
+def test_apply_in_pandas_returning_column_names(self):
+self._test_merge(fn=lambda lft, rgt: pd.merge(lft, rgt, on=["id", 
"k"]))
 
+def test_apply_in_pandas_returning_no_column_names(self):
 def merge_pandas(lft, rgt):
-return lft.size + rgt.size
+res = pd.merge(lft, rgt, on=["id", "k"])
+res.columns = range(res.columns.size)
+return res
 
-with QuietTest(self.sc):
-with self.assertRaisesRegex(
-PythonException,
-"Return type of the user-defined function should be 
pandas.DataFrame, "
-"but is ",
-):
-(
-left.groupby("id")
-.cogroup(right.groupby("id"))
-.applyInPandas(merge_pandas, "id long, k int, v int, v2 
int")
-.collect()
-)
+self._test_merge(fn=merge_pandas)
 
-def test_apply_in_pandas_returning_wrong_number_of_columns(self):
-left = self.data1
-right = self.data2
+def test_apply_in_pandas_returning_column_names_sometimes(self):
+def merge_pandas(lft, rgt):
+res = pd.merge(lft, rgt, on=["id", "k"])
+if 0 in lft["id"] and lft["id"][0] % 2 == 0:
+return res
+res.columns = range(res.columns.size)
+return res
+
+self._test_merge(fn=merge_pandas)
 
+def test_apply_in_pandas_returning_wrong_column_names(self):
 def merge_pandas(lft, rgt):
 if 0 in lft["id"] and lft["id"][0] % 2 == 0:
 lft["add"] = 0
 if 0 in rgt["id"] and rgt["id"][0] % 3 == 0:
 rgt["more"] = 1
 return pd.merge(lft, rgt, on=["id", "k"])
 
-with QuietTest(self.sc):
-with self.assertRaisesRegex(
-PythonException,
-"Number of columns of the returned pandas.DataFrame "
-"doesn't match specified schema. Expected: 4 Actual: 6",
-):
-(
-# merge_pandas returns two columns for even keys while we 
set schema to four
-left.groupby("id")
-.cogroup(right.groupby("id"))
-.applyInPandas(merge_pandas, "id long, k int, v int, v2 
int")
-.collect()
-)
+self._test_merge_error(
+fn=merge_pandas,
+error_class=PythonException,
+error_message_regex="Column names of the returned pandas.DataFrame 
"
+"do not match specified schema.  Unexpected: add, more  Schema: 
id, k, v, v2\n",
+)
 
-def test_apply_in_pandas_returning_empty_dataframe(self):
-left = self.data1
-right = self.data2
+# with very large schema, missing and unexpected is limited to 5
+# and the schema is abbreviated in the error message
+schema = "id long, k long, mean double, " + ", ".join(
+f"column_with_long_column_name_{no} integer" for no in range(35)
+)
+self._test_merge_error(
+fn=lambda lft, rgt: pd.DataFrame(
+[
+(
+lft.id,
+lft.k,
+lft.v.mean(),
+)
++ tuple(lft.v.mean() for _ in range(7))
+],
+columns=["id", "k", "mean"] + [f"extra_column_{no} integer" 
for no in range(7)],
+),
+output_schema=schema,
+error_class=PythonException,
+error_message_regex="Column names of the returned 
pandas\\.DataFrame "
+"do not match specified schema\\.  "
+"Missing \\(first 5 of 35\\): column_with_long_column_name_0,"
+" column_with_long_column_name_1, column_with_long_column_name_10,"
+" column_with_long_column_name_11, column_with_long_column_name_12 
 "
+"Unexpected \\(first 5 of 7\\): extra_column_0 integer, 
extra_column_1 integer,"
+" extra_column_2 integer, extra_column_3 integer, extra_column_4 
integer  "
+"Schema: id, k, mean, column_with_long_column_name_0, 
column_with_long_column_name_1,"
+" 

[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-11-23 Thread GitBox


EnricoMi commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1031150574


##
python/pyspark/worker.py:
##
@@ -146,7 +146,74 @@ def verify_result_type(result):
 )
 
 
-def wrap_cogrouped_map_pandas_udf(f, return_type, argspec):
+def verify_pandas_result(result, return_type, assign_cols_by_name):
+import pandas as pd
+
+if not isinstance(result, pd.DataFrame):
+raise TypeError(
+"Return type of the user-defined function should be "
+"pandas.DataFrame, but is {}".format(type(result))
+)
+
+# check the schema of the result only if it is not empty or has columns
+if not result.empty or len(result.columns) != 0:
+# if any column name of the result is a string
+# the column names of the result have to match the return type
+#   see create_array in 
pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer
+field_names = set([field.name for field in return_type.fields])
+column_names = set(result.columns)
+if (
+assign_cols_by_name
+and any(isinstance(name, str) for name in result.columns)
+and column_names != field_names
+):
+limit = 5

Review Comment:
   @HyukjinKwon I have removed the schema and columns limit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38787: [SPARK-41251][PS][INFRA] Upgrade pandas from 1.5.1 to 1.5.2

2022-11-23 Thread GitBox


HyukjinKwon commented on PR #38787:
URL: https://github.com/apache/spark/pull/38787#issuecomment-1326072793

   cc @Yikun @itholic @xinrong-meng FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


LuciferYang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031148805


##
core/pom.xml:
##
@@ -616,6 +621,50 @@
   
 
   
+  
+org.apache.maven.plugins
+maven-shade-plugin
+
+  false
+  true
+  
+
+  com.google.protobuf:*
+
+  
+  
+
+  com.google.protobuf
+  
${spark.shade.packageName}.spark-core.protobuf
+  
+com.google.protobuf.**
+  
+
+  
+
+  
+  
+com.github.os72
+protoc-jar-maven-plugin
+3.11.4
+
+
+  
+generate-test-sources

Review Comment:
   `generate-test-sources`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


LuciferYang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031147004


##
core/pom.xml:
##
@@ -616,6 +621,50 @@
   
 
   
+  
+org.apache.maven.plugins

Review Comment:
   SPARK-40593 and SPARK-41215 do some work to support Spark 3.4 to be compiled 
on CentOS6&7. I think the core module needs similar work, but may not be in 
this pr
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38774: [SPARK-41240][CONNECT][BUILD][INFRA] Upgrade `Protobuf` to 3.19.5

2022-11-23 Thread GitBox


zhengruifeng commented on PR #38774:
URL: https://github.com/apache/spark/pull/38774#issuecomment-1326069519

   we may also need to change this place 
https://github.com/apache/spark/blob/master/connector/connect/src/main/buf.gen.yaml#L30
   
   cc @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #38787: [SPARK-41251][PS][INFRA] Upgrade pandas from 1.5.1 to 1.5.2

2022-11-23 Thread GitBox


panbingkun opened a new pull request, #38787:
URL: https://github.com/apache/spark/pull/38787

   ### What changes were proposed in this pull request?
   This PR proposes upgrading pandas to 1.5.2, for pandas API on Spark.
   New version of pandas (1.5.12) was released at Nov 22, 2022.
   Release Notes: https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.5.2.html
   
   ### Why are the changes needed?
   We should follow the behavior of latest pandas, and support it.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38770: [SPARK-41238][CONNECT][PYTHON] Support more built-in datatypes

2022-11-23 Thread GitBox


zhengruifeng commented on PR #38770:
URL: https://github.com/apache/spark/pull/38770#issuecomment-1326058481

   > Can you test both nullable=true and nullable=false case?
   
   the tests added in `test_schema` covers those cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wilfred-s commented on pull request #38780: [SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs

2022-11-23 Thread GitBox


wilfred-s commented on PR #38780:
URL: https://github.com/apache/spark/pull/38780#issuecomment-1326057909

   already did that directly after it was logged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38782: [SPARK-38728][SQL] Test the error class: FAILED_RENAME_PATH

2022-11-23 Thread GitBox


LuciferYang commented on code in PR #38782:
URL: https://github.com/apache/spark/pull/38782#discussion_r1031134476


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala:
##
@@ -637,6 +637,33 @@ class QueryExecutionErrorsSuite
   sqlState = "0A000")
   }
 
+  test("FAILED_RENAME_PATH: rename when destination path already exists") {
+var srcPath: Path = null
+var dstPath: Path = null
+
+val e = intercept[SparkFileAlreadyExistsException](
+  withTempPath { p =>
+val basePath = new Path(p.getAbsolutePath)
+val fm = new FileSystemBasedCheckpointFileManager(basePath, new 
Configuration())
+srcPath = new Path(s"$basePath/src")
+fm.createAtomic(srcPath, overwriteIfPossible = true).close()
+dstPath = new Path(s"$basePath/dst")
+fm.createAtomic(dstPath, overwriteIfPossible = true).close()
+
+fm.renameTempFile(srcPath, dstPath, false)

Review Comment:
   Could we add this case through the user perspective API



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38598: [SPARK-41097][CORE][SQL][SS][PROTOBUF] Remove redundant collection conversion base on Scala 2.13 code

2022-11-23 Thread GitBox


LuciferYang commented on PR #38598:
URL: https://github.com/apache/spark/pull/38598#issuecomment-1326049324

   ready to merge


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38664: [SPARK-41147][SQL] Assign a name to the legacy error class `_LEGACY_ERROR_TEMP_1042`

2022-11-23 Thread GitBox


itholic commented on PR #38664:
URL: https://github.com/apache/spark/pull/38664#issuecomment-1326044896

   Thanks for the review, @MaxGekk 
   Just addressed the comments!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector

2022-11-23 Thread GitBox


gengliangwang commented on PR #38783:
URL: https://github.com/apache/spark/pull/38783#issuecomment-1326027832

   > shall we also shade protobuf in Spark Core in this PR? Then the proto 
version is unified in all the places.
   
   In this PR there is no actual usage. I would prefer to shade it in 
https://github.com/apache/spark/pull/38779
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38768: [SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type

2022-11-23 Thread GitBox


zhengruifeng commented on code in PR #38768:
URL: https://github.com/apache/spark/pull/38768#discussion_r1031102422


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -51,24 +52,11 @@
 
 
 class GroupingFrame(object):

Review Comment:
   not related to this PR, but shall we rename it `GroupedData` to be the same 
with pyspark?



##
python/pyspark/sql/connect/dataframe.py:
##
@@ -164,8 +156,20 @@ def selectExpr(self, *expr: Union[str, List[str]]) -> 
"DataFrame":
 
 return DataFrame.withPlan(plan.Project(self._plan, *sql_expr), 
session=self._session)
 
-def agg(self, exprs: Optional[GroupingFrame.MeasuresType]) -> "DataFrame":
-return self.groupBy().agg(exprs)
+def agg(self, *exprs: Union[Expression, Dict[str, str]]) -> "DataFrame":
+if not exprs:
+raise ValueError("Argument 'exprs' must not be empty")
+
+if len(exprs) == 1 and isinstance(exprs[0], dict):
+measures = []
+for e, fun in exprs[0].items():
+measures.append(ScalarFunctionExpression(fun, Column(e)))

Review Comment:
   ```suggestion
   measures = [ScalarFunctionExpression(f, Column(e)) for e, f in 
exprs[0].items()]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector

2022-11-23 Thread GitBox


cloud-fan commented on PR #38783:
URL: https://github.com/apache/spark/pull/38783#issuecomment-1326008658

   shall we also shade protobuf in Spark Core in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38783:
URL: https://github.com/apache/spark/pull/38783#discussion_r1031094917


##
pom.xml:
##
@@ -118,7 +118,10 @@
 2.19.0
 
 3.3.4
-2.5.0
+
+
2.5.0
+

Review Comment:
   ```suggestion
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092

2022-11-23 Thread GitBox


MaxGekk closed pull request #38710: [SPARK-41179][SQL] Assign a name to the 
error class _LEGACY_ERROR_TEMP_1092
URL: https://github.com/apache/spark/pull/38710


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092

2022-11-23 Thread GitBox


MaxGekk commented on PR #38710:
URL: https://github.com/apache/spark/pull/38710#issuecomment-1326004383

   +1, LGTM. Merging to master.
   Thank you, @panbingkun.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38786: [SPARK-41250][CONNECT][PYTHON] DataFrame.to_pandas should not return optional pandas dataframe

2022-11-23 Thread GitBox


amaliujia commented on PR #38786:
URL: https://github.com/apache/spark/pull/38786#issuecomment-1326002119

   @zhengruifeng @HyukjinKwon 
   cc @xinrong-meng @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


MaxGekk commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031089197


##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   > "anyvalue(c2)" .. "any_value(c2)"
   
   hmm, could you open a PR and override `prettyName` in `AnyValue`, and pass 
`prettyName` to `FirstLast.validateIgnoreNullExpr()`. Also, cc @vitaliili-db 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #38786: [SPARK-41250][CONNECT][PYTHON] DataFrame.to_pandas should not return optional pandas dataframe

2022-11-23 Thread GitBox


amaliujia opened a new pull request, #38786:
URL: https://github.com/apache/spark/pull/38786

   
   
   ### What changes were proposed in this pull request?
   
   The server guarantees to send at least one arrow batch with schema even 
there is empty result. In this case, ` `DataFrame.to_pandas` always can return 
a Pandas DataFrame.
   
   This PR decouples the client side execution path for `Command` and 
`Relation` to remove `Optional` from the returneed type of 
`DataFrame.to_pandas`. 
   
   
   ### Why are the changes needed?
   
   API coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first  one is 
also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   1. Maybe should we separate the parameters into 2 parts something like 
 and  ??
   
   ```scala
   new AnalysisException(
 errorClass = "MISSING_AGGREGATION",
 messageParameters = Map(
   "expression" -> toSQLExpr(expression)
   "expressionAnyValue" -> toSQLExpr(new AnyValue(expression)))
   ```
   
   2. Or maybe we can simply recommend to use the `any_value()` function as 
below:
   
   ```
   ... aggregate the expression, or use `any_value()` for  if you 
do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first  one is 
also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   1. Maybe should we separate the parameters into 2 parts something like 
 and  ??
   
   ```scala
   new AnalysisException(
 errorClass = "MISSING_AGGREGATION",
 messageParameters = Map(
   "expression" -> toSQLExpr(expression)
   "expression" -> toSQLExpr(new AnyValue(expression)))
   ```
   
   2. Or maybe we can simply recommend to use the `any_value()` function as 
below:
   
   ```
   ... aggregate the expression, or use `any_value()` for  if you 
do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first  one is 
also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   1. Maybe should we separate the parameters into 2 parts something like 
 and  ??
   
   2. Or maybe we can simply recommend to use the `any_value()` function as 
below:
   
   ```
   ... aggregate the expression, or use `any_value()` for  if you 
do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


itholic commented on PR #38769:
URL: https://github.com/apache/spark/pull/38769#issuecomment-1325994942

   Fixed the Python test first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031085119


##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   Or maybe we simply recommend to use the `any_value()` function as below:
   
   ```
   ... aggregate the expression, or use `any_value()` for  if you 
do not care ...
   ```
   
   something like that?



##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   Or maybe we can simply recommend to use the `any_value()` function as below:
   
   ```
   ... aggregate the expression, or use `any_value()` for  if you 
do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042

2022-11-23 Thread GitBox


MaxGekk closed pull request #38707: [SPARK-41176][SQL] Assign a name to the 
error class _LEGACY_ERROR_TEMP_1042
URL: https://github.com/apache/spark/pull/38707


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042

2022-11-23 Thread GitBox


MaxGekk commented on PR #38707:
URL: https://github.com/apache/spark/pull/38707#issuecomment-1325992375

   +1, LGTM. Merging to master.
   Thank you, @panbingkun and @srielau for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first  one is 
also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   Maybe should we separate the parameters into 2 parts like  and 
 ??



##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first  one is 
also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   Maybe should we separate the parameters into 2 parts something like 
 and  ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38782: [SPARK-38728][SQL] Test the error class: FAILED_RENAME_PATH

2022-11-23 Thread GitBox


MaxGekk commented on PR #38782:
URL: https://github.com/apache/spark/pull/38782#issuecomment-1325989625

   cc @panbingkun @LuciferYang @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38766: [MINOR][SQL] Fix error message for `UNEXPECTED_INPUT_TYPE`

2022-11-23 Thread GitBox


MaxGekk commented on PR #38766:
URL: https://github.com/apache/spark/pull/38766#issuecomment-1325988492

   +1, LGTM. Merged to master.
   Thank you, @itholic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk closed pull request #38766: [MINOR][SQL] Fix error message for `UNEXPECTED_INPUT_TYPE`

2022-11-23 Thread GitBox


MaxGekk closed pull request #38766: [MINOR][SQL] Fix error message for 
`UNEXPECTED_INPUT_TYPE`
URL: https://github.com/apache/spark/pull/38766


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #38784: [SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646

2022-11-23 Thread GitBox


MaxGekk commented on code in PR #38784:
URL: https://github.com/apache/spark/pull/38784#discussion_r1031078969


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3629,6 +3629,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val JSON_ENABLE_PARTIAL_RESULTS =
+buildConf("spark.sql.json.enablePartialResults")
+  .internal()
+  .doc("When set to true, enables partial results for structs, maps, and 
arrays in JSON " +
+"when one or more fields do not match the schema")
+  .version("3.4.0")
+  .booleanConf
+  .createWithDefault(false)

Review Comment:
   Could you re-gen results of `JsonBenchmark`. If there are no significant 
diffs, I am ok to enable it. Also, it would be nice to add one more benchmark 
for this particular case of partial results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #38772: [SPARK-41237][SQL] Assign a name to the error class `_LEGACY_ERROR_TEMP_0030`

2022-11-23 Thread GitBox


MaxGekk commented on PR #38772:
URL: https://github.com/apache/spark/pull/38772#issuecomment-1325981880

   > Maybe do we want to consolidate them in one rule ?
   
   I agree. Let's use consolidate. I would prefer `DATATYPE` since it is 
shorter ;-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

2022-11-23 Thread GitBox


itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031072613


##
core/src/main/resources/error/error-classes.json:
##
@@ -785,6 +779,13 @@
   "Malformed Protobuf messages are detected in message deserialization. 
Parse Mode: . To process malformed protobuf message as null 
result, try setting the option 'mode' as 'PERMISSIVE'."
 ]
   },
+  "MISSING_AGGREGATION" : {
+"message" : [
+  "The non-aggregating expression  is based on columns which 
are not participating in the GROUP BY clause.",
+  "Add the columns or the expression to the GROUP BY, aggregate the 
expression, or use \"any_value()\" if you do not care which of the 
values within a group is returned."

Review Comment:
   Cool. Let me update soon.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38772: [SPARK-41237][SQL] Assign a name to the error class `_LEGACY_ERROR_TEMP_0030`

2022-11-23 Thread GitBox


itholic commented on PR #38772:
URL: https://github.com/apache/spark/pull/38772#issuecomment-1325973768

   Btw, we use both `DATA_TYPE` and `DATATYPE` for error class name.
   Maybe do we want to consolidate them in one rule ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-23 Thread GitBox


HyukjinKwon closed pull request #38659: [SPARK-41114][CONNECT] Support local 
data for LocalRelation
URL: https://github.com/apache/spark/pull/38659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation

2022-11-23 Thread GitBox


HyukjinKwon commented on PR #38659:
URL: https://github.com/apache/spark/pull/38659#issuecomment-1325972726

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38772: [SPARK-41237][SQL] Assign a name to the error class `_LEGACY_ERROR_TEMP_0030`

2022-11-23 Thread GitBox


itholic commented on PR #38772:
URL: https://github.com/apache/spark/pull/38772#issuecomment-1325972686

   > Can't you re-use the existing error class `UNSUPPORTED_DATATYPE`?
   
   Oh... I missed that one. Sounds good!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #38785: [SPARK-41249][SS] Add acceptance test for self-union on streaming query

2022-11-23 Thread GitBox


HeartSaVioR commented on PR #38785:
URL: https://github.com/apache/spark/pull/38785#issuecomment-1325972411

   cc. @zsxwing @viirya Please take a look. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR opened a new pull request, #38785: [SPARK-41249][SS] Add acceptance test for self-union on streaming query

2022-11-23 Thread GitBox


HeartSaVioR opened a new pull request, #38785:
URL: https://github.com/apache/spark/pull/38785

   ### What changes were proposed in this pull request?
   
   This PR proposes to add a new test suite specifically for self-union tests 
on streaming query. The test cases are acceptance tests for 4 different cases, 
DSv1 vs DSv2 / DataStreamReader API vs table API.
   
   ### Why are the changes needed?
   
   This PR brings more test coverage on streaming workloads.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New test suite.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale

2022-11-23 Thread GitBox


ulysses-you commented on code in PR #38739:
URL: https://github.com/apache/spark/pull/38739#discussion_r1031054191


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala:
##
@@ -276,9 +276,9 @@ class DecimalPrecisionSuite extends AnalysisTest with 
BeforeAndAfter {
   val a = AttributeReference("a", DecimalType(3, -10))()
   val b = AttributeReference("b", DecimalType(1, -1))()
   val c = AttributeReference("c", DecimalType(35, 1))()
-  checkType(Multiply(a, b), DecimalType(5, -11))
-  checkType(Multiply(a, c), DecimalType(38, -9))
-  checkType(Multiply(b, c), DecimalType(37, 0))
+  checkType(Multiply(a, b), DecimalType(16, 0))

Review Comment:
   Not sure what you found. I think the reason why Divide and IntegralDivide 
fail is simple. SQL strandard does not allow negative scale, but we use its 
definition formula to calculate the result precision and scale. Then the result 
precision can be negative which is unexpected. So I think other binary 
arithmetic also should not follow if scale is negative.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #38784: [SPARK-41248] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646

2022-11-23 Thread GitBox


sadikovi commented on code in PR #38784:
URL: https://github.com/apache/spark/pull/38784#discussion_r1031065321


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3629,6 +3629,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val JSON_ENABLE_PARTIAL_RESULTS =
+buildConf("spark.sql.json.enablePartialResults")
+  .internal()
+  .doc("When set to true, enables partial results for structs, maps, and 
arrays in JSON " +
+"when one or more fields do not match the schema")
+  .version("3.4.0")
+  .booleanConf
+  .createWithDefault(false)

Review Comment:
   I am still debating whether to keep this as `true` or `false`. On one hand, 
when enabled, it fixes the correctness issue of partially parsing JSON records. 
When disabled, it could cause performance issues (impact is yet to be 
confirmed).
   
   @MaxGekk I would love to know your thoughts on this. I would prefer to keep 
it enabled until the benchmark is produced that shows the regression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi opened a new pull request, #38784: [SPARK-41248] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646

2022-11-23 Thread GitBox


sadikovi opened a new pull request, #38784:
URL: https://github.com/apache/spark/pull/38784

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds a SQL config `spark.sql.json.enablePartialResults` to control 
SPARK-40646 change. This allows us to fall back to the behaviour before the 
change.
   
   It was observed that SPARK-40646 could cause a performance regression for 
deeply nested schemas. I, however, could not reproduce the regression with OSS 
benchmarks (maybe we need to extend them, I can do it as a follow-up). 
Regardless, I propose to add a SQL config to have an ability to disable the 
change in case of performance degradation during JSON parsing.
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   SQL config `spark.sql.json.enablePartialResults` is added to control the 
behaviour of SPARK-40646 JSON partial results parsing. Users can disable the 
feature if they find any performance regressions when reading JSON files.
   
   ### How was this patch tested?
   
   
   I extended existing unit tests to test with flag enabled and disabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


gengliangwang commented on PR #38779:
URL: https://github.com/apache/spark/pull/38779#issuecomment-1325963483

   Pending on https://github.com/apache/spark/pull/38783 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


gengliangwang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031060603


##
core/pom.xml:
##
@@ -532,7 +533,12 @@
   org.apache.commons
   commons-crypto
 
-
+
+  com.google.protobuf

Review Comment:
   Created https://github.com/apache/spark/pull/38783



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector

2022-11-23 Thread GitBox


gengliangwang opened a new pull request, #38783:
URL: https://github.com/apache/spark/pull/38783

   
   
   ### What changes were proposed in this pull request?
   
   
   Unify the Protobuf versions in Spark connect and Protobuf connector. 
   
   ### Why are the changes needed?
   
   The Protobuf dependencies will have the same behavior. And future upgrades 
will be easier.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   ### How was this patch tested?
   
   GA tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


gengliangwang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031058129


##
core/pom.xml:
##
@@ -616,6 +621,50 @@
   
 
   
+  
+org.apache.maven.plugins
+maven-shade-plugin
+
+  false
+  true
+  
+
+  com.google.protobuf:*
+
+  
+  
+
+  com.google.protobuf
+  
${spark.shade.packageName}.spark-core.protobuf
+  
+com.google.protobuf.**
+  
+
+  
+
+  
+  
+com.github.os72
+protoc-jar-maven-plugin
+3.11.4

Review Comment:
   The plugin is already the latest: 
   https://mvnrepository.com/artifact/com.github.os72/protoc-jar-maven-plugin
   
   But yes we unify the versions with the one in protobuf connector.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


gengliangwang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031056472


##
core/pom.xml:
##
@@ -532,7 +533,12 @@
   org.apache.commons
   commons-crypto
 
-
+
+  com.google.protobuf

Review Comment:
   This is for the core module. Do you mean put the changes of the other pom 
files into an individual PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on PR #38750:
URL: https://github.com/apache/spark/pull/38750#issuecomment-1325955535

   cc @MaxGekk @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on PR #38750:
URL: https://github.com/apache/spark/pull/38750#issuecomment-1325955422

   nice refactor! We should have done this earlier, before adding ansi interval 
types and timestamp ntz. Now we should have more confidence of these new data 
types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38750:
URL: https://github.com/apache/spark/pull/38750#discussion_r1031053828


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala:
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.types
+
+import org.apache.spark.sql.types._
+
+sealed abstract class PhysicalDataType {}
+
+case class PhysicalArrayType(elementType: DataType, containsNull: Boolean)
+extends PhysicalDataType {}
+
+case class PhysicalBinaryType() extends PhysicalDataType {}
+
+case class PhysicalBooleanType() extends PhysicalDataType {}
+
+case class PhysicalByteType() extends PhysicalDataType {}
+
+case class PhysicalCalendarIntervalType() extends PhysicalDataType {}
+
+case class PhysicalDecimalType(precision: Int, scale: Int) extends 
PhysicalDataType {}
+
+case class PhysicalDoubleType() extends PhysicalDataType {}
+
+case class PhysicalFloatType() extends PhysicalDataType {}
+
+case class PhysicalIntegerType() extends PhysicalDataType {}
+
+case class PhysicalLongType() extends PhysicalDataType {}
+
+case class PhysicalMapType(keyType: DataType, valueType: DataType, 
valueContainsNull: Boolean)
+extends PhysicalDataType {}
+
+case class PhysicalNullType() extends PhysicalDataType {}
+
+case class PhysicalObjectType(cls: Class[_]) extends PhysicalDataType {}

Review Comment:
   do we really need this? I feel it's like `UserDefinedType` and we should 
never get its physical type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale

2022-11-23 Thread GitBox


ulysses-you commented on code in PR #38739:
URL: https://github.com/apache/spark/pull/38739#discussion_r1031054191


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala:
##
@@ -276,9 +276,9 @@ class DecimalPrecisionSuite extends AnalysisTest with 
BeforeAndAfter {
   val a = AttributeReference("a", DecimalType(3, -10))()
   val b = AttributeReference("b", DecimalType(1, -1))()
   val c = AttributeReference("c", DecimalType(35, 1))()
-  checkType(Multiply(a, b), DecimalType(5, -11))
-  checkType(Multiply(a, c), DecimalType(38, -9))
-  checkType(Multiply(b, c), DecimalType(37, 0))
+  checkType(Multiply(a, b), DecimalType(16, 0))

Review Comment:
   Not sure what you found. I think the reason why Divide and IntegralDivide 
fail is simple. SQL strandard does not allow negative scale, but we use its 
definition formula to calculate the result precision and scale. Then the result 
precision can be nagetive which is unexpected. So I think other binary 
arithmetic also should not follow if scale is negative.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38750:
URL: https://github.com/apache/spark/pull/38750#discussion_r1031054150


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala:
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.types
+
+import org.apache.spark.sql.types._
+
+sealed abstract class PhysicalDataType {}
+
+case class PhysicalArrayType(elementType: DataType, containsNull: Boolean)
+extends PhysicalDataType {}
+
+case class PhysicalBinaryType() extends PhysicalDataType {}

Review Comment:
   should they be scala `object`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38750:
URL: https://github.com/apache/spark/pull/38750#discussion_r1031053475


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala:
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.types
+
+import org.apache.spark.sql.types._
+
+sealed abstract class PhysicalDataType {}

Review Comment:
   nit: in scala we can omit `{}` if it's empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38750:
URL: https://github.com/apache/spark/pull/38750#discussion_r1031052882


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:
##
@@ -1901,24 +1904,27 @@ object CodeGenerator extends Logging {
* Returns the Java type for a DataType.
*/
   def javaType(dt: DataType): String = dt match {
-case BooleanType => JAVA_BOOLEAN
-case ByteType => JAVA_BYTE
-case ShortType => JAVA_SHORT
-case IntegerType | DateType | _: YearMonthIntervalType => JAVA_INT
-case LongType | TimestampType | TimestampNTZType | _: DayTimeIntervalType 
=> JAVA_LONG
-case FloatType => JAVA_FLOAT
-case DoubleType => JAVA_DOUBLE
-case _: DecimalType => "Decimal"
-case BinaryType => "byte[]"
-case StringType => "UTF8String"
-case CalendarIntervalType => "CalendarInterval"
-case _: StructType => "InternalRow"
-case _: ArrayType => "ArrayData"
-case _: MapType => "MapData"
 case udt: UserDefinedType[_] => javaType(udt.sqlType)
-case ObjectType(cls) if cls.isArray => 
s"${javaType(ObjectType(cls.getComponentType))}[]"
-case ObjectType(cls) => cls.getName
-case _ => "Object"
+case _ => dt.physicalDataType match {

Review Comment:
   nit:
   ```
   dt.physicalDataType match {
 ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38750:
URL: https://github.com/apache/spark/pull/38750#discussion_r1031051704


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala:
##
@@ -253,13 +255,16 @@ object RowEncoder {
   }
 case _: DayTimeIntervalType => ObjectType(classOf[java.time.Duration])
 case _: YearMonthIntervalType => ObjectType(classOf[java.time.Period])
-case _: DecimalType => ObjectType(classOf[java.math.BigDecimal])
-case StringType => ObjectType(classOf[java.lang.String])
-case _: ArrayType => ObjectType(classOf[scala.collection.Seq[_]])
-case _: MapType => ObjectType(classOf[scala.collection.Map[_, _]])
-case _: StructType => ObjectType(classOf[Row])
 case p: PythonUserDefinedType => externalDataTypeFor(p.sqlType)
 case udt: UserDefinedType[_] => ObjectType(udt.userClass)
+case _ => dt.physicalDataType match {
+  case _: PhysicalArrayType => ObjectType(classOf[scala.collection.Seq[_]])
+  case _: PhysicalDecimalType => ObjectType(classOf[java.math.BigDecimal])
+  case _: PhysicalMapType => ObjectType(classOf[scala.collection.Map[_, 
_]])
+  case _: PhysicalStringType => ObjectType(classOf[java.lang.String])
+  case _: PhysicalStructType => ObjectType(classOf[Row])
+  case _ => dt

Review Comment:
   ditto, add some comments



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala:
##
@@ -358,6 +363,7 @@ object RowEncoder {
   If(IsNull(input),
 Literal.create(null, externalDataTypeFor(input.dataType)),
 CreateExternalRow(convertedFields, schema))
+case _ => input

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38750:
URL: https://github.com/apache/spark/pull/38750#discussion_r1031051302


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala:
##
@@ -214,6 +215,7 @@ object RowEncoder {
   } else {
 nonNullOutput
   }
+case _ => inputObject

Review Comment:
   can we add a comment? `For other data types, just return the internal 
catalyst value as it is.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38750:
URL: https://github.com/apache/spark/pull/38750#discussion_r1031050991


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala:
##
@@ -129,24 +130,25 @@ object InternalRow {
*/
   def getAccessor(dt: DataType, nullable: Boolean = true): 
(SpecializedGetters, Int) => Any = {
 val getValueNullSafe: (SpecializedGetters, Int) => Any = dt match {

Review Comment:
   instead of
   ```
   dt match {
 case _ => dt.physicalDataType match ...
   }
   ```
   we can just do
   ```
   dt.physicalDataType match ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ibuder opened a new pull request, #38782: [SPARK-38728][SQL] Test the error class: FAILED_RENAME_PATH

2022-11-23 Thread GitBox


ibuder opened a new pull request, #38782:
URL: https://github.com/apache/spark/pull/38782

   ### What changes were proposed in this pull request?
   
   This adds a test for error class FAILED_RENAME_PATH in 
QueryExecutionErrorsSuite.
   
   
   ### Why are the changes needed?
   
   @MaxGekk this addresses https://issues.apache.org/jira/browse/SPARK-38728
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   The only change is the addition of the FAILED_RENAME_PATH test.
   
   ### License
   
   This contribution is my original work. I license the work to the project 
under the project’s open source license.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38760:
URL: https://github.com/apache/spark/pull/38760#discussion_r1031049024


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala:
##
@@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   // have overflowed our Long; in either case we must rescale dv to the 
new scale.
   dv = dv.setScale(scale, roundMode)
   if (dv.precision > precision) {

Review Comment:
   I think this makes sense. If we write `SELECT 0bd` in Spark, the returned 
decimal is also `decimal(1, 0)`. Maybe forbidding `decimal(0, 0)` is a better 
choice. What do you think @srielau ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #38780: [SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs

2022-11-23 Thread GitBox


Yikun commented on PR #38780:
URL: https://github.com/apache/spark/pull/38780#issuecomment-1325947902

   @wilfred-s Would you mind setting your github action accroding to note of 
https://github.com/apache/spark/pull/38780/checks?check_run_id=9680278988.
   
   cc @dongjoon-hyun @yangwwei 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yorksity opened a new pull request, #38781: [SPARK-41246][core] Solve the problem of RddId negative

2022-11-23 Thread GitBox


yorksity opened a new pull request, #38781:
URL: https://github.com/apache/spark/pull/38781

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   solve the problem occurs in long running tasks, such as stream tasks
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites

2022-11-23 Thread GitBox


cloud-fan commented on PR #38640:
URL: https://github.com/apache/spark/pull/38640#issuecomment-1325947062

   > Actually, I'm happy to work on making parquet v2 tables available in a 
separate ticket/PR if you can give my some guidance.
   
   I tried to do it long time ago but failed as there are some design issues. 
We need to fully understand the use cases of `CREATE TABLE ... USING v1Source` 
and see how to make it work for v2 sources:
   1. Just a name mapping, so that people can use table name instead of 
providing all the data source information every time. JDBC data source is a 
good example of it.
   2. Schema cache. The data source may have a large cost to infer the data 
schema and need the Spark catalog to cache it. File source is a good example 
here.
   
   We also need to think about the semantic of `ALTER TABLE`, `REFRESH TABLE`, 
`df.write.mode("overwrite").saveAsTable`, etc.
   
   Some code references. For v1 source, we have a rule `FindDataSourceTable` to 
resolve table with v1 source. For v2 source, we probably should have a similar 
rule to resolve v2 source to `TableProvider`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)

2022-11-23 Thread GitBox


ulysses-you commented on code in PR #38760:
URL: https://github.com/apache/spark/pull/38760#discussion_r1031044398


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala:
##
@@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   // have overflowed our Long; in either case we must rescale dv to the 
new scale.
   dv = dv.setScale(scale, roundMode)
   if (dv.precision > precision) {

Review Comment:
   it's 1, can see the docs in `java.math.BigDecimal`
   https://user-images.githubusercontent.com/12025282/203693808-37715386-1fda-4134-aafc-b558ee55b7fe.png;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38768: [SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type

2022-11-23 Thread GitBox


amaliujia commented on PR #38768:
URL: https://github.com/apache/spark/pull/38768#issuecomment-1325935616

   @zhengruifeng @grundprinzip can you take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wilfred-s opened a new pull request, #38780: [SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs

2022-11-23 Thread GitBox


wilfred-s opened a new pull request, #38780:
URL: https://github.com/apache/spark/pull/38780

   ### What changes were proposed in this pull request?
   Remove the limitations section from the K8s documentation for YuniKorn.
   
   ### Why are the changes needed?
   The limitation section is not correct. YuniKorn is fully supported from 
release 1.1.0 onwards. YuniKorn 1.1.0 is the release that is referenced in the 
documentation.
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031036283


##
core/pom.xml:
##
@@ -616,6 +621,50 @@
   
 
   
+  
+org.apache.maven.plugins
+maven-shade-plugin
+
+  false
+  true
+  
+
+  com.google.protobuf:*
+
+  
+  
+
+  com.google.protobuf
+  
${spark.shade.packageName}.spark-core.protobuf
+  
+com.google.protobuf.**
+  
+
+  
+
+  
+  
+com.github.os72
+protoc-jar-maven-plugin
+3.11.4

Review Comment:
   This is shaded, I don't think the version matters to hadoop2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38760:
URL: https://github.com/apache/spark/pull/38760#discussion_r1031035938


##
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala:
##
@@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   // have overflowed our Long; in either case we must rescale dv to the 
new scale.
   dv = dv.setScale(scale, roundMode)
   if (dv.precision > precision) {

Review Comment:
   what is the precision of `BigDecimal` in this case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38760:
URL: https://github.com/apache/spark/pull/38760#discussion_r1031035738


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -3537,6 +3537,12 @@ class DataFrameSuite extends QueryTest
   }.isEmpty)
 }
   }
+
+  test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") {
+val df = Seq("0.5944910").toDF("a")
+checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0))
+checkAnswer(df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as 
decimal(0,0))"), Row(0))

Review Comment:
   Now `cast(a as decimal(7,7)) div 100` fails and we want to fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38768: [SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type

2022-11-23 Thread GitBox


amaliujia commented on code in PR #38768:
URL: https://github.com/apache/spark/pull/38768#discussion_r1031035663


##
python/pyspark/sql/connect/plan.py:
##
@@ -558,29 +557,19 @@ def _repr_html_(self) -> str:
 
 
 class Aggregate(LogicalPlan):
-MeasureType = Tuple["ExpressionOrString", str]
-MeasuresType = Sequence[MeasureType]
-OptMeasuresType = Optional[MeasuresType]
-
 def __init__(
 self,
 child: Optional["LogicalPlan"],
 grouping_cols: List[Column],
-measures: OptMeasuresType,

Review Comment:
   This is a sequence now and I think it can be len=0 which is empty measures?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


LuciferYang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031033790


##
project/SparkBuild.scala:
##
@@ -607,9 +607,25 @@ object SparkParallelTestGrouping {
 
 object Core {
   import scala.sys.process.Process
+  import BuildCommons.protoVersion
   def buildenv = Process(Seq("uname")).!!.trim.replaceFirst("[^A-Za-z0-9].*", 
"").toLowerCase
   def bashpath = Process(Seq("where", 
"bash")).!!.split("[\r\n]+").head.replace('\\', '/')
   lazy val settings = Seq(
+// Setting version for the protobuf compiler. This has to be propagated to 
every sub-project
+// even if the project is not using it.
+PB.protocVersion := BuildCommons.protoVersion,
+// For some reason the resolution from the imported Maven build does not 
work for some
+// of these dependendencies that we need to shade later on.
+libraryDependencies ++= {
+  Seq(
+"io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion 
asProtocPlugin(),

Review Comment:
   Is grpc plugin required?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on pull request #38761: [SPARK-40988][SQL][TEST] Test case for insert partition should verify value

2022-11-23 Thread GitBox


ulysses-you commented on PR #38761:
URL: https://github.com/apache/spark/pull/38761#issuecomment-1325911082

   thank you @rangareddy , it seems some wrong with github action. Can you 
rebase your branch to retry ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


LuciferYang commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031027267


##
core/pom.xml:
##
@@ -616,6 +621,50 @@
   
 
   
+  
+org.apache.maven.plugins
+maven-shade-plugin
+
+  false
+  true
+  
+
+  com.google.protobuf:*
+
+  
+  
+
+  com.google.protobuf
+  
${spark.shade.packageName}.spark-core.protobuf
+  
+com.google.protobuf.**
+  
+
+  
+
+  
+  
+com.github.os72
+protoc-jar-maven-plugin
+3.11.4

Review Comment:
   Need to check the impact on hadoop-2, although Spark 3.4 will not release 
hadoop-2 distribution, but CI is still running
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031025029


##
core/pom.xml:
##
@@ -616,6 +621,50 @@
   
 
   
+  
+org.apache.maven.plugins
+maven-shade-plugin
+
+  false
+  true
+  
+
+  com.google.protobuf:*
+
+  
+  
+
+  com.google.protobuf
+  
${spark.shade.packageName}.spark-core.protobuf
+  
+com.google.protobuf.**
+  
+
+  
+
+  
+  
+com.github.os72
+protoc-jar-maven-plugin
+3.11.4

Review Comment:
   shall we pick `3.21.9` as it was used in the protobuf module?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38779:
URL: https://github.com/apache/spark/pull/38779#discussion_r1031025176


##
core/pom.xml:
##
@@ -532,7 +533,12 @@
   org.apache.commons
   commons-crypto
 
-
+
+  com.google.protobuf

Review Comment:
   Can we have individual PR for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/

2022-11-23 Thread GitBox


zhengruifeng commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1031023071


##
connector/connect/src/main/protobuf/spark/connect/base.proto:
##
@@ -100,18 +70,138 @@ message AnalyzePlanRequest {
   // logging purposes and will not be interpreted by the server.
   optional string client_type = 4;
 
-  // (Optional) Get the explain string of the plan.
-  Explain explain = 5;
+  repeated AnalysisTask tasks = 5;
+
+  message AnalysisTask {
+oneof task {
+  // Get the schema
+  Schema schema = 1;
+
+  // Is local
+  IsLocal is_local = 2;
+
+  // Is Streaming
+  IsStreaming is_streaming = 3;
+
+  // Get the explain string of the plan.
+  Explain explain = 4;
+
+  // Get the tree string of the schema.
+  TreeString tree_string = 5;
+
+  // Get the input files.
+  InputFiles input_files = 6;
+
+  // Get the semantic hash
+  SemanticHash semantic_hash = 7;
+
+  // Check whether plans are equal.

Review Comment:
   will remove semantic_hash and same_semantics since they are developer apis, 
although they were also in pyspark



##
connector/connect/src/main/protobuf/spark/connect/base.proto:
##
@@ -100,18 +70,138 @@ message AnalyzePlanRequest {
   // logging purposes and will not be interpreted by the server.
   optional string client_type = 4;
 
-  // (Optional) Get the explain string of the plan.
-  Explain explain = 5;
+  repeated AnalysisTask tasks = 5;
+
+  message AnalysisTask {
+oneof task {
+  // Get the schema
+  Schema schema = 1;
+
+  // Is local
+  IsLocal is_local = 2;
+
+  // Is Streaming
+  IsStreaming is_streaming = 3;
+
+  // Get the explain string of the plan.
+  Explain explain = 4;
+
+  // Get the tree string of the schema.
+  TreeString tree_string = 5;
+
+  // Get the input files.
+  InputFiles input_files = 6;
+
+  // Get the semantic hash
+  SemanticHash semantic_hash = 7;

Review Comment:
   will remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/

2022-11-23 Thread GitBox


zhengruifeng commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1031022749


##
connector/connect/src/main/protobuf/spark/connect/base.proto:
##
@@ -100,18 +70,138 @@ message AnalyzePlanRequest {
   // logging purposes and will not be interpreted by the server.
   optional string client_type = 4;
 
-  // (Optional) Get the explain string of the plan.
-  Explain explain = 5;
+  repeated AnalysisTask tasks = 5;
+
+  message AnalysisTask {
+oneof task {
+  // Get the schema
+  Schema schema = 1;
+
+  // Is local
+  IsLocal is_local = 2;
+
+  // Is Streaming
+  IsStreaming is_streaming = 3;
+
+  // Get the explain string of the plan.
+  Explain explain = 4;
+
+  // Get the tree string of the schema.
+  TreeString tree_string = 5;
+
+  // Get the input files.
+  InputFiles input_files = 6;
+
+  // Get the semantic hash
+  SemanticHash semantic_hash = 7;
+
+  // Check whether plans are equal.
+  SameSemantics same_semantics = 8;
+}
+  }
+
+  // Analyze the input plan and return the schema.
+  message Schema { }
+
+  // Returns true if the `collect` and `take` methods can be run locally.
+  message IsLocal { }
+
+  // Returns true if this Dataset contains one or more sources that 
continuously
+  // return data as it arrives.
+  message IsStreaming { }
+
+  // Explains the input plan based on a configurable mode.
+  message Explain {
+// Plan explanation mode.
+enum ExplainMode {
+  MODE_UNSPECIFIED = 0;
+
+  // Generates only physical plan.
+  SIMPLE = 1;
+
+  // Generates parsed logical plan, analyzed logical plan, optimized 
logical plan and physical plan.
+  // Parsed Logical plan is a unresolved plan that extracted from the 
query. Analyzed logical plans
+  // transforms which translates unresolvedAttribute and 
unresolvedRelation into fully typed objects.
+  // The optimized logical plan transforms through a set of optimization 
rules, resulting in the
+  // physical plan.
+  EXTENDED = 2;
+
+  // Generates code for the statement, if any and a physical plan.
+  CODEGEN = 3;
+
+  // If plan node statistics are available, generates a logical plan and 
also the statistics.
+  COST = 4;
+
+  // Generates a physical plan outline and also node details.
+  FORMATTED = 5;
+}
+
+// (Required) For analyzePlan rpc calls, configure the mode to explain 
plan in strings.
+ExplainMode explain_mode= 1;
+  }
+
+  // Generate a string to express the schema in a nice tree format.
+  // It will invoke 'StructType.treeString' (same as 'Dataset.printSchema')
+  // to compute the results.
+  message TreeString {
+
+// (Optional) The level to generate the string.
+optional int32 level = 1;
+  }
+
+  // Returns a best-effort snapshot of the files that compose this Dataset.
+  // It will invoke 'Dataset.inputFiles' to compute the results.
+  message InputFiles { }

Review Comment:
   it had some usages anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/

2022-11-23 Thread GitBox


zhengruifeng commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1031021916


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -797,6 +796,137 @@ def schema(self) -> StructType:
 else:
 return self._schema
 
+@property
+def isLocal(self) -> bool:
+"""Returns ``True`` if the :func:`collect` and :func:`take` methods 
can be run locally
+(without any Spark executors).
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+bool
+"""
+if self._plan is None:
+raise Exception("Cannot analyze on empty plan.")
+query = self._plan.to_proto(self._session)
+return self._session.is_local(query)
+
+@property
+def isStreaming(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` contains one or more 
sources that
+continuously return data as it arrives. A :class:`DataFrame` that 
reads data from a
+streaming source must be executed as a :class:`StreamingQuery` using 
the :func:`start`
+method in :class:`DataStreamWriter`.  Methods that return a single 
answer, (e.g.,
+:func:`count` or :func:`collect`) will throw an 
:class:`AnalysisException` when there
+is a streaming source present.
+
+.. versionadded:: 3.4.0
+
+Notes
+-
+This API is evolving.
+
+Returns
+---
+bool
+Whether it's streaming DataFrame or not.
+"""
+if self._plan is None:
+raise Exception("Cannot analyze on empty plan.")
+query = self._plan.to_proto(self._session)
+return self._session.is_streaming(query)
+
+def printSchema(self) -> None:
+"""Prints out the schema in the tree format.
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+None
+"""
+if self._plan is None:
+raise Exception("Cannot analyze on empty plan.")
+query = self._plan.to_proto(self._session)
+print(self._session.tree_string(query))
+
+def semanticHash(self) -> int:

Review Comment:
   oh, I did not notice that, I am fine to remove `sameSemantics` and 
`semanticHash`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.

2022-11-23 Thread GitBox


LuciferYang commented on PR #38743:
URL: https://github.com/apache/spark/pull/38743#issuecomment-1325892309

   Congratulations @WolverineJiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WolverineJiang commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.

2022-11-23 Thread GitBox


WolverineJiang commented on PR #38743:
URL: https://github.com/apache/spark/pull/38743#issuecomment-1325891951

   Thanks @HyukjinKwon @LuciferYang @AmplabJenkins~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

2022-11-23 Thread GitBox


zhengruifeng commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1031016264


##
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##
@@ -58,6 +58,12 @@ def test_join_condition(self):
 )._plan.to_proto(self.connect)
 self.assertIsNotNone(plan.root.join.join_condition)
 
+def test_crossjoin(self):
+left_input = self.connect.readTable(table_name=self.tbl_name)
+right_input = self.connect.readTable(table_name=self.tbl_name)
+plan = 
left_input.crossJoin(other=right_input)._plan.to_proto(self.connect)
+self.assertEqual(plan.root.join.join_type, 7)  # JOIN_TYPE_CROSS
+

Review Comment:
   after this PR, `DataFrame.join` also support `cross`, I think we can also 
add a test for it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.

2022-11-23 Thread GitBox


HyukjinKwon closed pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support 
user configurable protoc executables when building Spark Protobuf.
URL: https://github.com/apache/spark/pull/38743


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.

2022-11-23 Thread GitBox


HyukjinKwon commented on PR #38743:
URL: https://github.com/apache/spark/pull/38743#issuecomment-1325885697

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/sem

2022-11-23 Thread GitBox


cloud-fan commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1031014545


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -797,6 +796,137 @@ def schema(self) -> StructType:
 else:
 return self._schema
 
+@property
+def isLocal(self) -> bool:
+"""Returns ``True`` if the :func:`collect` and :func:`take` methods 
can be run locally
+(without any Spark executors).
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+bool
+"""
+if self._plan is None:
+raise Exception("Cannot analyze on empty plan.")
+query = self._plan.to_proto(self._session)
+return self._session.is_local(query)
+
+@property
+def isStreaming(self) -> bool:
+"""Returns ``True`` if this :class:`DataFrame` contains one or more 
sources that
+continuously return data as it arrives. A :class:`DataFrame` that 
reads data from a
+streaming source must be executed as a :class:`StreamingQuery` using 
the :func:`start`
+method in :class:`DataStreamWriter`.  Methods that return a single 
answer, (e.g.,
+:func:`count` or :func:`collect`) will throw an 
:class:`AnalysisException` when there
+is a streaming source present.
+
+.. versionadded:: 3.4.0
+
+Notes
+-
+This API is evolving.
+
+Returns
+---
+bool
+Whether it's streaming DataFrame or not.
+"""
+if self._plan is None:
+raise Exception("Cannot analyze on empty plan.")
+query = self._plan.to_proto(self._session)
+return self._session.is_streaming(query)
+
+def printSchema(self) -> None:
+"""Prints out the schema in the tree format.
+
+.. versionadded:: 3.4.0
+
+Returns
+---
+None
+"""
+if self._plan is None:
+raise Exception("Cannot analyze on empty plan.")
+query = self._plan.to_proto(self._session)
+print(self._session.tree_string(query))
+
+def semanticHash(self) -> int:

Review Comment:
   This is a developer API in Dataset, do we really need to provide it in Spark 
connect?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI

2022-11-23 Thread GitBox


dongjoon-hyun commented on code in PR #38567:
URL: https://github.com/apache/spark/pull/38567#discussion_r1031012684


##
core/src/main/scala/org/apache/spark/status/AppStatusStore.scala:
##
@@ -769,7 +772,12 @@ private[spark] object AppStatusStore {
   def createLiveStore(
   conf: SparkConf,
   appStatusSource: Option[AppStatusSource] = None): AppStatusStore = {
-val store = new ElementTrackingStore(new InMemoryStore(), conf)
+val storePath = conf.get(LIVE_UI_LOCAL_STORE_DIR).map(new File(_))
+// For the disk-based KV store of live UI, let's simply make it ROCKSDB 
only for now,
+// instead of supporting both LevelDB and RocksDB. RocksDB is built based 
on LevelDB with

Review Comment:
   Got it. Thank you for confirming that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38757: [SPARK-41222][CONNECT][PYTHON] Unify the typing definitions

2022-11-23 Thread GitBox


zhengruifeng commented on PR #38757:
URL: https://github.com/apache/spark/pull/38757#issuecomment-1325877024

   merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #38757: [SPARK-41222][CONNECT][PYTHON] Unify the typing definitions

2022-11-23 Thread GitBox


zhengruifeng closed pull request #38757: [SPARK-41222][CONNECT][PYTHON] Unify 
the typing definitions
URL: https://github.com/apache/spark/pull/38757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/

2022-11-23 Thread GitBox


zhengruifeng commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1031003410


##
connector/connect/src/main/protobuf/spark/connect/base.proto:
##
@@ -100,18 +70,138 @@ message AnalyzePlanRequest {
   // logging purposes and will not be interpreted by the server.
   optional string client_type = 4;
 
-  // (Optional) Get the explain string of the plan.
-  Explain explain = 5;
+  repeated AnalysisTask tasks = 5;
+
+  message AnalysisTask {
+oneof task {
+  // Get the schema
+  Schema schema = 1;
+
+  // Is local
+  IsLocal is_local = 2;
+
+  // Is Streaming
+  IsStreaming is_streaming = 3;
+
+  // Get the explain string of the plan.
+  Explain explain = 4;
+
+  // Get the tree string of the schema.
+  TreeString tree_string = 5;
+
+  // Get the input files.
+  InputFiles input_files = 6;
+
+  // Get the semantic hash
+  SemanticHash semantic_hash = 7;
+
+  // Check whether plans are equal.
+  SameSemantics same_semantics = 8;
+}
+  }
+
+  // Analyze the input plan and return the schema.
+  message Schema { }
+
+  // Returns true if the `collect` and `take` methods can be run locally.
+  message IsLocal { }
+
+  // Returns true if this Dataset contains one or more sources that 
continuously
+  // return data as it arrives.
+  message IsStreaming { }
+
+  // Explains the input plan based on a configurable mode.
+  message Explain {
+// Plan explanation mode.
+enum ExplainMode {
+  MODE_UNSPECIFIED = 0;
+
+  // Generates only physical plan.
+  SIMPLE = 1;
+
+  // Generates parsed logical plan, analyzed logical plan, optimized 
logical plan and physical plan.
+  // Parsed Logical plan is a unresolved plan that extracted from the 
query. Analyzed logical plans
+  // transforms which translates unresolvedAttribute and 
unresolvedRelation into fully typed objects.
+  // The optimized logical plan transforms through a set of optimization 
rules, resulting in the
+  // physical plan.
+  EXTENDED = 2;
+
+  // Generates code for the statement, if any and a physical plan.
+  CODEGEN = 3;
+
+  // If plan node statistics are available, generates a logical plan and 
also the statistics.
+  COST = 4;
+
+  // Generates a physical plan outline and also node details.
+  FORMATTED = 5;
+}
+
+// (Required) For analyzePlan rpc calls, configure the mode to explain 
plan in strings.
+ExplainMode explain_mode= 1;
+  }
+
+  // Generate a string to express the schema in a nice tree format.
+  // It will invoke 'StructType.treeString' (same as 'Dataset.printSchema')
+  // to compute the results.
+  message TreeString {
+
+// (Optional) The level to generate the string.

Review Comment:
   will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yabola commented on a diff in pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service

2022-11-23 Thread GitBox


yabola commented on code in PR #38560:
URL: https://github.com/apache/spark/pull/38560#discussion_r1030998613


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -327,43 +327,52 @@ class BlockManagerMasterEndpoint(
   }
 }.toSeq
 
-// Find all shuffle blocks on executors that are no longer running
-val blocksToDeleteByShuffleService =
-  new mutable.HashMap[BlockManagerId, mutable.HashSet[BlockId]]
+var removeShuffleFromShuffleServicesFutures = Seq.empty[Future[Boolean]]
 if (externalShuffleServiceRemoveShuffleEnabled) {
-  mapOutputTracker.shuffleStatuses.get(shuffleId).foreach { shuffleStatus 
=>
-shuffleStatus.withMapStatuses { mapStatuses =>
-  mapStatuses.foreach { mapStatus =>
-// Check if the executor has been deallocated
-if 
(!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) {
-  val blocksToDel =
-
shuffleManager.shuffleBlockResolver.getBlocksForShuffle(shuffleId, 
mapStatus.mapId)
-  if (blocksToDel.nonEmpty) {
-val blocks = 
blocksToDeleteByShuffleService.getOrElseUpdate(mapStatus.location,
-  new mutable.HashSet[BlockId])
-blocks ++= blocksToDel
+  val shuffleClient = externalBlockStoreClient.get
+  // Find all shuffle blocks on executors that are no longer running
+  val blocksToDelete = new mutable.HashMap[BlockManagerId, 
mutable.HashSet[BlockId]]
+  mapOutputTracker.shuffleStatuses.get(shuffleId) match {
+case Some(shuffleStatus) =>
+  shuffleStatus.withMapStatuses { mapStatuses =>
+mapStatuses.foreach { mapStatus =>
+  // Check if the executor has been deallocated
+  if 
(!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) {
+val blocksToDel = shuffleManager.shuffleBlockResolver
+  .getBlocksForShuffle(shuffleId, mapStatus.mapId)
+if (blocksToDel.nonEmpty) {
+  val blocks = 
blocksToDelete.getOrElseUpdate(mapStatus.location,
+new mutable.HashSet[BlockId])
+  blocks ++= blocksToDel
+}

Review Comment:
   > What if the shuffle statuses are not exists ?
   I think it will not, please see [case match 
codes](https://github.com/apache/spark/pull/38560/files#diff-21bb40987f7d21c8d3d3de3212f388c210fb04a636c96bdef53cab6070c6aa46R336)



##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -327,43 +327,52 @@ class BlockManagerMasterEndpoint(
   }
 }.toSeq
 
-// Find all shuffle blocks on executors that are no longer running
-val blocksToDeleteByShuffleService =
-  new mutable.HashMap[BlockManagerId, mutable.HashSet[BlockId]]
+var removeShuffleFromShuffleServicesFutures = Seq.empty[Future[Boolean]]
 if (externalShuffleServiceRemoveShuffleEnabled) {
-  mapOutputTracker.shuffleStatuses.get(shuffleId).foreach { shuffleStatus 
=>
-shuffleStatus.withMapStatuses { mapStatuses =>
-  mapStatuses.foreach { mapStatus =>
-// Check if the executor has been deallocated
-if 
(!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) {
-  val blocksToDel =
-
shuffleManager.shuffleBlockResolver.getBlocksForShuffle(shuffleId, 
mapStatus.mapId)
-  if (blocksToDel.nonEmpty) {
-val blocks = 
blocksToDeleteByShuffleService.getOrElseUpdate(mapStatus.location,
-  new mutable.HashSet[BlockId])
-blocks ++= blocksToDel
+  val shuffleClient = externalBlockStoreClient.get
+  // Find all shuffle blocks on executors that are no longer running
+  val blocksToDelete = new mutable.HashMap[BlockManagerId, 
mutable.HashSet[BlockId]]
+  mapOutputTracker.shuffleStatuses.get(shuffleId) match {
+case Some(shuffleStatus) =>
+  shuffleStatus.withMapStatuses { mapStatuses =>
+mapStatuses.foreach { mapStatus =>
+  // Check if the executor has been deallocated
+  if 
(!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) {
+val blocksToDel = shuffleManager.shuffleBlockResolver
+  .getBlocksForShuffle(shuffleId, mapStatus.mapId)
+if (blocksToDel.nonEmpty) {
+  val blocks = 
blocksToDelete.getOrElseUpdate(mapStatus.location,
+new mutable.HashSet[BlockId])
+  blocks ++= blocksToDel
+}

Review Comment:
   > What if the shuffle statuses are not exists ?
   
   I think it will not, please see [case match 
codes](https://github.com/apache/spark/pull/38560/files#diff-21bb40987f7d21c8d3d3de3212f388c210fb04a636c96bdef53cab6070c6aa46R336)



-- 
This is an automated message from the Apache Git Service.
To respond to 

[GitHub] [spark] yabola commented on a diff in pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service

2022-11-23 Thread GitBox


yabola commented on code in PR #38560:
URL: https://github.com/apache/spark/pull/38560#discussion_r1023816561


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -654,8 +731,7 @@ public MergeStatuses 
finalizeShuffleMerge(FinalizeShuffleMerge msg) {
   // If no blocks pushed for the finalizeShuffleMerge shuffleMergeId 
then return
   // empty MergeStatuses but cleanup the older shuffleMergeId files.
   submitCleanupTask(() ->
-  closeAndDeleteOutdatedPartitions(
-  appAttemptShuffleMergeId, 
mergePartitionsInfo.shuffleMergePartitions));
+  deleteCurrentShufflePartitions(appShuffleInfo, shuffleId, 
appAttemptShuffleMergeId));

Review Comment:
   I checked the `appAttemptShuffleMergeId` in the code before. 
   I think if we want to delete partitions merged data, then we should delete 
the corresponding ShuffleMergeId in DB (Otherwise, inconsistency will occur 
when restoring shuffle info from db)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092

2022-11-23 Thread GitBox


panbingkun commented on PR #38710:
URL: https://github.com/apache/spark/pull/38710#issuecomment-1325855794

   > @panbingkun Please, resolve conflicts.
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/sem

2022-11-23 Thread GitBox


amaliujia commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1030992346


##
connector/connect/src/main/protobuf/spark/connect/base.proto:
##
@@ -100,18 +70,138 @@ message AnalyzePlanRequest {
   // logging purposes and will not be interpreted by the server.
   optional string client_type = 4;
 
-  // (Optional) Get the explain string of the plan.
-  Explain explain = 5;
+  repeated AnalysisTask tasks = 5;
+
+  message AnalysisTask {
+oneof task {
+  // Get the schema
+  Schema schema = 1;
+
+  // Is local
+  IsLocal is_local = 2;
+
+  // Is Streaming
+  IsStreaming is_streaming = 3;
+
+  // Get the explain string of the plan.
+  Explain explain = 4;
+
+  // Get the tree string of the schema.
+  TreeString tree_string = 5;
+
+  // Get the input files.
+  InputFiles input_files = 6;
+
+  // Get the semantic hash
+  SemanticHash semantic_hash = 7;
+
+  // Check whether plans are equal.
+  SameSemantics same_semantics = 8;
+}
+  }
+
+  // Analyze the input plan and return the schema.
+  message Schema { }
+
+  // Returns true if the `collect` and `take` methods can be run locally.
+  message IsLocal { }
+
+  // Returns true if this Dataset contains one or more sources that 
continuously
+  // return data as it arrives.
+  message IsStreaming { }
+
+  // Explains the input plan based on a configurable mode.
+  message Explain {
+// Plan explanation mode.
+enum ExplainMode {
+  MODE_UNSPECIFIED = 0;
+
+  // Generates only physical plan.
+  SIMPLE = 1;
+
+  // Generates parsed logical plan, analyzed logical plan, optimized 
logical plan and physical plan.
+  // Parsed Logical plan is a unresolved plan that extracted from the 
query. Analyzed logical plans
+  // transforms which translates unresolvedAttribute and 
unresolvedRelation into fully typed objects.
+  // The optimized logical plan transforms through a set of optimization 
rules, resulting in the
+  // physical plan.
+  EXTENDED = 2;
+
+  // Generates code for the statement, if any and a physical plan.
+  CODEGEN = 3;
+
+  // If plan node statistics are available, generates a logical plan and 
also the statistics.
+  COST = 4;
+
+  // Generates a physical plan outline and also node details.
+  FORMATTED = 5;
+}
+
+// (Required) For analyzePlan rpc calls, configure the mode to explain 
plan in strings.
+ExplainMode explain_mode= 1;
+  }
+
+  // Generate a string to express the schema in a nice tree format.
+  // It will invoke 'StructType.treeString' (same as 'Dataset.printSchema')
+  // to compute the results.
+  message TreeString {
+
+// (Optional) The level to generate the string.

Review Comment:
   Document what is the default value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/

2022-11-23 Thread GitBox


zhengruifeng commented on code in PR #38742:
URL: https://github.com/apache/spark/pull/38742#discussion_r1030992308


##
connector/connect/src/main/protobuf/spark/connect/base.proto:
##
@@ -100,18 +70,138 @@ message AnalyzePlanRequest {
   // logging purposes and will not be interpreted by the server.
   optional string client_type = 4;
 
-  // (Optional) Get the explain string of the plan.
-  Explain explain = 5;
+  repeated AnalysisTask tasks = 5;

Review Comment:
   multiple analysis tasks is for this case:  user can get all attributes in 
single RPC and then cache them for reusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #38777: [SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent

2022-11-23 Thread GitBox


HeartSaVioR commented on PR #38777:
URL: https://github.com/apache/spark/pull/38777#issuecomment-1325844004

   Ah OK, let's wait for feedback from @ala and ensure we make clear before 
merging it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yaohua628 commented on pull request #38777: [SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent

2022-11-23 Thread GitBox


Yaohua628 commented on PR #38777:
URL: https://github.com/apache/spark/pull/38777#issuecomment-1325843243

   Thank you, Jungtaek! Also wanna confirm with @ala on nullability of 
`row_index`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38770: [SPARK-41238][CONNECT][PYTHON] Support more datatypes

2022-11-23 Thread GitBox


amaliujia commented on PR #38770:
URL: https://github.com/apache/spark/pull/38770#issuecomment-1325840723

   Thanks. 
   
   Can you test both `nullable=true` and `nullable=false` case? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38767: [SPARK-41183][SQL][FOLLOWUP] Change the name from injectPlanNormalizationRules to injectPlanNormalizationRule

2022-11-23 Thread GitBox


HyukjinKwon closed pull request #38767: [SPARK-41183][SQL][FOLLOWUP] Change the 
name from injectPlanNormalizationRules to injectPlanNormalizationRule
URL: https://github.com/apache/spark/pull/38767


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38767: [SPARK-41183][SQL][FOLLOWUP] Change the name from injectPlanNormalizationRules to injectPlanNormalizationRule

2022-11-23 Thread GitBox


HyukjinKwon commented on PR #38767:
URL: https://github.com/apache/spark/pull/38767#issuecomment-1325839678

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >