Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-06 Thread via GitHub


HyukjinKwon closed pull request #44177: [SPARK-46260][PYTHON][SQL] 
`DataFrame.withColumnsRenamed` should respect the dict ordering
URL: https://github.com/apache/spark/pull/44177


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-06 Thread via GitHub


HyukjinKwon commented on PR #44177:
URL: https://github.com/apache/spark/pull/44177#issuecomment-1842389525

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-05 Thread via GitHub


dongjoon-hyun commented on PR #44177:
URL: https://github.com/apache/spark/pull/44177#issuecomment-1840220510

   To @zhengruifeng , I believe we can proceed with the AS-IS status if this is 
only for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-04 Thread via GitHub


zhengruifeng commented on code in PR #44177:
URL: https://github.com/apache/spark/pull/44177#discussion_r1414903099


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -987,6 +987,12 @@ class DataFrameSuite extends QueryTest
   parameters = Map("columnName" -> "`age`"))
   }
 
+  test("SPARK-46260: withColumnsRenamed should respect the Map ordering") {
+val df = spark.range(10).toDF()
+assert(df.withColumnsRenamed(Map("id" -> "a", "a" -> "b")).columns === 
Array("b"))
+assert(df.withColumnsRenamed(Map("a" -> "b", "id" -> "a")).columns === 
Array("a"))
+  }

Review Comment:
   hmm, to avoid such incompatibility issues, I think I need to use `ListMap` 
instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-04 Thread via GitHub


dongjoon-hyun commented on code in PR #44177:
URL: https://github.com/apache/spark/pull/44177#discussion_r1414902050


##
python/pyspark/sql/tests/connect/test_parity_dataframe.py:
##
@@ -77,6 +77,11 @@ def test_to_pandas_from_mixed_dataframe(self):
 def test_toDF_with_string(self):
 super().test_toDF_with_string()
 
+# TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict 
ordering

Review Comment:
   Could you update this comment too?
   ```
   - # TODO(SPARK-46261): DataFrame.withColumnsRenamed should respect the dict 
ordering
   + # TODO(SPARK-46261): Python Client DataFrame.withColumnsRenamed should 
respect the dict ordering
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-04 Thread via GitHub


dongjoon-hyun commented on code in PR #44177:
URL: https://github.com/apache/spark/pull/44177#discussion_r1414901137


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -987,6 +987,12 @@ class DataFrameSuite extends QueryTest
   parameters = Map("columnName" -> "`age`"))
   }
 
+  test("SPARK-46260: withColumnsRenamed should respect the Map ordering") {
+val df = spark.range(10).toDF()
+assert(df.withColumnsRenamed(Map("id" -> "a", "a" -> "b")).columns === 
Array("b"))
+assert(df.withColumnsRenamed(Map("a" -> "b", "id" -> "a")).columns === 
Array("a"))
+  }

Review Comment:
   Thank you for this addition. Actually, `Map` has many incompatibility issues 
across multiple Scala versions. For example, Scala 2.11/2.12/2.13.
   
   For now, since we have only Scala 2.13, I guess the behavior was consistent 
on master branch. And, this PR will help it more.
   
   Also, cc @LuciferYang , too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-04 Thread via GitHub


zhengruifeng commented on code in PR #44177:
URL: https://github.com/apache/spark/pull/44177#discussion_r1414898203


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2922,18 +2922,29 @@ class Dataset[T] private[sql](
*/
   @throws[AnalysisException]
   def withColumnsRenamed(colsMap: Map[String, String]): DataFrame = withOrigin 
{
+val (colNames, newColNames) = colsMap.toSeq.unzip
+withColumnsRenamed(colNames, newColNames)
+  }
+
+  private def withColumnsRenamed(
+colNames: Seq[String],
+newColNames: Seq[String]): DataFrame = withOrigin {
+require(colNames.size == newColNames.size,
+  s"The size of existing column names: ${colNames.size} isn't equal to " +
+s"the size of new column names: ${newColNames.size}")
+
 val resolver = sparkSession.sessionState.analyzer.resolver
 val output: Seq[NamedExpression] = queryExecution.analyzed.output
 
-val projectList = colsMap.foldLeft(output) {
+val projectList = colNames.zip(newColNames).foldLeft(output) {

Review Comment:
   this is a python issue.
   to make sure the scala side is not changed, I add a scala test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-04 Thread via GitHub


dongjoon-hyun commented on code in PR #44177:
URL: https://github.com/apache/spark/pull/44177#discussion_r1414890990


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2922,18 +2922,29 @@ class Dataset[T] private[sql](
*/
   @throws[AnalysisException]
   def withColumnsRenamed(colsMap: Map[String, String]): DataFrame = withOrigin 
{
+val (colNames, newColNames) = colsMap.toSeq.unzip
+withColumnsRenamed(colNames, newColNames)
+  }
+
+  private def withColumnsRenamed(
+colNames: Seq[String],
+newColNames: Seq[String]): DataFrame = withOrigin {
+require(colNames.size == newColNames.size,
+  s"The size of existing column names: ${colNames.size} isn't equal to " +
+s"the size of new column names: ${newColNames.size}")
+
 val resolver = sparkSession.sessionState.analyzer.resolver
 val output: Seq[NamedExpression] = queryExecution.analyzed.output
 
-val projectList = colsMap.foldLeft(output) {
+val projectList = colNames.zip(newColNames).foldLeft(output) {

Review Comment:
   We believe we need a Scala test case for this change, @zhengruifeng , 
because the PR description claims that Scala code ordering matters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46260][PYTHON][SQL] `DataFrame.withColumnsRenamed` should respect the dict ordering [spark]

2023-12-04 Thread via GitHub


dongjoon-hyun commented on code in PR #44177:
URL: https://github.com/apache/spark/pull/44177#discussion_r1414891855


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2922,18 +2922,29 @@ class Dataset[T] private[sql](
*/
   @throws[AnalysisException]
   def withColumnsRenamed(colsMap: Map[String, String]): DataFrame = withOrigin 
{
+val (colNames, newColNames) = colsMap.toSeq.unzip
+withColumnsRenamed(colNames, newColNames)
+  }
+
+  private def withColumnsRenamed(
+colNames: Seq[String],
+newColNames: Seq[String]): DataFrame = withOrigin {
+require(colNames.size == newColNames.size,
+  s"The size of existing column names: ${colNames.size} isn't equal to " +
+s"the size of new column names: ${newColNames.size}")
+
 val resolver = sparkSession.sessionState.analyzer.resolver
 val output: Seq[NamedExpression] = queryExecution.analyzed.output
 
-val projectList = colsMap.foldLeft(output) {
+val projectList = colNames.zip(newColNames).foldLeft(output) {

Review Comment:
   Could you add a new simple test case which fails without this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org