Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
panbingkun commented on code in PR #44665: URL: https://github.com/apache/spark/pull/44665#discussion_r1519100897 ## python/pyspark/sql/functions/builtin.py: ## @@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col | 2,Alice| +-+ -Example 2: Converting a complex StructType to a CSV string - ->>> from pyspark.sql import Row, functions as sf ->>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))] ->>> df = spark.createDataFrame(data, ("key", "value")) ->>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP -+---+ -|to_csv(value) | -+---+ -|2,Alice,"[100,200,300]"| Review Comment: Let me update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
LuciferYang commented on code in PR #44665: URL: https://github.com/apache/spark/pull/44665#discussion_r1519047021 ## python/pyspark/sql/functions/builtin.py: ## @@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col | 2,Alice| +-+ -Example 2: Converting a complex StructType to a CSV string - ->>> from pyspark.sql import Row, functions as sf ->>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))] ->>> df = spark.createDataFrame(data, ("key", "value")) ->>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP -+---+ -|to_csv(value) | -+---+ -|2,Alice,"[100,200,300]"| Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
MaxGekk commented on code in PR #44665: URL: https://github.com/apache/spark/pull/44665#discussion_r1518964720 ## python/pyspark/sql/functions/builtin.py: ## @@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col | 2,Alice| +-+ -Example 2: Converting a complex StructType to a CSV string - ->>> from pyspark.sql import Row, functions as sf ->>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))] ->>> df = spark.createDataFrame(data, ("key", "value")) ->>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP -+---+ -|to_csv(value) | -+---+ -|2,Alice,"[100,200,300]"| Review Comment: This is a breaking change in fact. We need to update the SQL migration guide at least. ## sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala: ## @@ -294,10 +294,19 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession { } test("to_csv with option (nullValue)") { Review Comment: Could you explain, please, why did you change the test? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala: ## @@ -260,16 +263,36 @@ case class StructsToCsv( child = child, timeZoneId = None) + override def checkInputDataTypes(): TypeCheckResult = { +child.dataType match { + case schema: StructType if schema.map(_.dataType).forall( +dt => supportDataType(dt)) => TypeCheckSuccess + case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE") + case _ => DataTypeMismatch( +errorSubClass = "UNEXPECTED_INPUT_TYPE", +messageParameters = Map( + "paramIndex" -> ordinalNumber(0), + "requiredType" -> toSQLType(StructType), + "inputSql" -> toSQLExpr(child), + "inputType" -> toSQLType(child.dataType) +) + ) +} + } + + @tailrec + private def supportDataType(dataType: DataType): Boolean = dataType match { +case _: VariantType | BinaryType => false Review Comment: Could you clarify this? Why don't you support the types? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala: ## @@ -260,16 +263,36 @@ case class StructsToCsv( child = child, timeZoneId = None) + override def checkInputDataTypes(): TypeCheckResult = { +child.dataType match { + case schema: StructType if schema.map(_.dataType).forall( +dt => supportDataType(dt)) => TypeCheckSuccess + case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE") + case _ => DataTypeMismatch( +errorSubClass = "UNEXPECTED_INPUT_TYPE", +messageParameters = Map( + "paramIndex" -> ordinalNumber(0), + "requiredType" -> toSQLType(StructType), + "inputSql" -> toSQLExpr(child), + "inputType" -> toSQLType(child.dataType) +) + ) +} + } + + @tailrec + private def supportDataType(dataType: DataType): Boolean = dataType match { Review Comment: ```suggestion private def isSupportedDataType(dataType: DataType): Boolean = dataType match { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
panbingkun commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-1985106638 friendly ping @HyukjinKwon, When you are not busy, can you please continue to help review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
panbingkun commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-1961086132 > Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4 Yea, I think we need to backport it to `branch-3.5/3.4`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
LuciferYang commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-1960689067 Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
panbingkun commented on PR #44665: URL: https://github.com/apache/spark/pull/44665#issuecomment-1960618407 > Would you please add `[PYTHON]` to title considering it has user-facing python change? Thanks! Sure, done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]
panbingkun commented on code in PR #44665: URL: https://github.com/apache/spark/pull/44665#discussion_r1500137200 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala: ## @@ -260,16 +263,43 @@ case class StructsToCsv( child = child, timeZoneId = None) + override def checkInputDataTypes(): TypeCheckResult = { +child.dataType match { + case schema: StructType if schema.map(_.dataType).forall( +dt => supportDataType(dt)) => TypeCheckSuccess + case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE") + case _ => DataTypeMismatch( +errorSubClass = "UNEXPECTED_INPUT_TYPE", +messageParameters = Map( + "paramIndex" -> "1", + "requiredType" -> toSQLType(StructType), + "inputSql" -> toSQLExpr(child), + "inputType" -> toSQLType(child.dataType) +) + ) +} + } + + @tailrec + private def supportDataType(dataType: DataType): Boolean = dataType match { +case _: VariantType => false Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org