Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-03-10 Thread via GitHub


panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519100897


##
python/pyspark/sql/functions/builtin.py:
##
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: 
Optional[Dict[str, str]] = None) -> Col
 |  2,Alice|
 +-+
 
-Example 2: Converting a complex StructType to a CSV string
-
->>> from pyspark.sql import Row, functions as sf
->>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
->>> df = spark.createDataFrame(data, ("key", "value"))
->>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-+---+
-|to_csv(value)  |
-+---+
-|2,Alice,"[100,200,300]"|

Review Comment:
   Let me update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-03-10 Thread via GitHub


LuciferYang commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519047021


##
python/pyspark/sql/functions/builtin.py:
##
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: 
Optional[Dict[str, str]] = None) -> Col
 |  2,Alice|
 +-+
 
-Example 2: Converting a complex StructType to a CSV string
-
->>> from pyspark.sql import Row, functions as sf
->>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
->>> df = spark.createDataFrame(data, ("key", "value"))
->>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-+---+
-|to_csv(value)  |
-+---+
-|2,Alice,"[100,200,300]"|

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-03-10 Thread via GitHub


MaxGekk commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1518964720


##
python/pyspark/sql/functions/builtin.py:
##
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: 
Optional[Dict[str, str]] = None) -> Col
 |  2,Alice|
 +-+
 
-Example 2: Converting a complex StructType to a CSV string
-
->>> from pyspark.sql import Row, functions as sf
->>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
->>> df = spark.createDataFrame(data, ("key", "value"))
->>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-+---+
-|to_csv(value)  |
-+---+
-|2,Alice,"[100,200,300]"|

Review Comment:
   This is a breaking change in fact. We need to update the SQL migration guide 
at least.



##
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##
@@ -294,10 +294,19 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSparkSession {
   }
 
   test("to_csv with option (nullValue)") {

Review Comment:
   Could you explain, please, why did you change the test?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##
@@ -260,16 +263,36 @@ case class StructsToCsv(
   child = child,
   timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case schema: StructType if schema.map(_.dataType).forall(
+dt => supportDataType(dt)) => TypeCheckSuccess
+  case _: StructType => DataTypeMismatch(errorSubClass = 
"TO_CSV_COMPLEX_TYPE")
+  case _ => DataTypeMismatch(
+errorSubClass = "UNEXPECTED_INPUT_TYPE",
+messageParameters = Map(
+  "paramIndex" -> ordinalNumber(0),
+  "requiredType" -> toSQLType(StructType),
+  "inputSql" -> toSQLExpr(child),
+  "inputType" -> toSQLType(child.dataType)
+)
+  )
+}
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {
+case _: VariantType | BinaryType => false

Review Comment:
   Could you clarify this? Why don't you support the types?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##
@@ -260,16 +263,36 @@ case class StructsToCsv(
   child = child,
   timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case schema: StructType if schema.map(_.dataType).forall(
+dt => supportDataType(dt)) => TypeCheckSuccess
+  case _: StructType => DataTypeMismatch(errorSubClass = 
"TO_CSV_COMPLEX_TYPE")
+  case _ => DataTypeMismatch(
+errorSubClass = "UNEXPECTED_INPUT_TYPE",
+messageParameters = Map(
+  "paramIndex" -> ordinalNumber(0),
+  "requiredType" -> toSQLType(StructType),
+  "inputSql" -> toSQLExpr(child),
+  "inputType" -> toSQLType(child.dataType)
+)
+  )
+}
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {

Review Comment:
   ```suggestion
 private def isSupportedDataType(dataType: DataType): Boolean = dataType 
match {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-03-07 Thread via GitHub


panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1985106638

   friendly ping @HyukjinKwon,
   When you are not busy, can you please continue to help review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-02-23 Thread via GitHub


panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1961086132

   > Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4
   
   Yea, I think we need to backport it to `branch-3.5/3.4`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-02-22 Thread via GitHub


LuciferYang commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1960689067

   Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-02-22 Thread via GitHub


panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1960618407

   > Would you please add `[PYTHON]` to title considering it has user-facing 
python change? Thanks!
   
   Sure, done. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

2024-02-22 Thread via GitHub


panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1500137200


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##
@@ -260,16 +263,43 @@ case class StructsToCsv(
   child = child,
   timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+child.dataType match {
+  case schema: StructType if schema.map(_.dataType).forall(
+dt => supportDataType(dt)) => TypeCheckSuccess
+  case _: StructType => DataTypeMismatch(errorSubClass = 
"TO_CSV_COMPLEX_TYPE")
+  case _ => DataTypeMismatch(
+errorSubClass = "UNEXPECTED_INPUT_TYPE",
+messageParameters = Map(
+  "paramIndex" -> "1",
+  "requiredType" -> toSQLType(StructType),
+  "inputSql" -> toSQLExpr(child),
+  "inputType" -> toSQLType(child.dataType)
+)
+  )
+}
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {
+case _: VariantType => false

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org