Maciej Szymkiewicz created SPARK-33415: ------------------------------------------
Summary: Column.__repr__ shouldn't encode JVM response Key: SPARK-33415 URL: https://issues.apache.org/jira/browse/SPARK-33415 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.1.0 Reporter: Maciej Szymkiewicz At the moment PySpark {{Column}} {{encodes}} JVM response in {{__repr__}} method. As a result, column names using only ASCII characters get {{b}} prefix {code:python} >>> from pyspark.sql.functions import col >>> >>> >>> col("abc") >>> >>> Column<b'abc'> {code} and the others ugly byte string {code:python} >>> col("wąż") >>> >>> Column<b'w\xc4\x85\xc5\xbc'> {code} This behaviour is inconsistent with other parts of the API, for example: {code:python} >>> spark.createDataFrame([], "`wąż` long") >>> >>> DataFrame[wąż: bigint] {code} and Scala {code:scala} scala> col("wąż") res0: org.apache.spark.sql.Column = wąż {code} and R {code:r} > column("wąż") Column wąż {code} Encoding has been originally introduced with SPARK-5859, but it doesn't seem like it is really required. Desired behaviour {code:python} >>> col("wąż") >>> >>> Column<'wąż'> {code} or {code:python} >>> col("wąż") >>> >>> Column<wąż> {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org