Maciej Szymkiewicz created SPARK-33415:
------------------------------------------

             Summary: Column.__repr__ shouldn't encode JVM response
                 Key: SPARK-33415
                 URL: https://issues.apache.org/jira/browse/SPARK-33415
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
    Affects Versions: 3.1.0
            Reporter: Maciej Szymkiewicz


At the moment PySpark {{Column}} {{encodes}} JVM response in {{__repr__}} 
method.

As a result, column names using only ASCII characters get {{b}} prefix

{code:python}
>>> from pyspark.sql.functions import col                                       
>>>                                                                             
>>>                              
>>> col("abc")                                                                  
>>>                                                                             
>>>                              
Column<b'abc'>
{code}

and the others ugly byte string

{code:python}
>>> col("wąż")                                                                  
>>>                                                                             
>>>                              
Column<b'w\xc4\x85\xc5\xbc'>
{code}

This behaviour is inconsistent with other parts of the API, for example:

{code:python}
>>> spark.createDataFrame([], "`wąż` long")                                     
>>>                                                                             
>>>                              
DataFrame[wąż: bigint]
{code}

and Scala

{code:scala}
scala> col("wąż")
res0: org.apache.spark.sql.Column = wąż
{code}

and R

{code:r}
> column("wąż")
Column wąż 
{code}

Encoding has been originally introduced with SPARK-5859, but it doesn't seem 
like it is really required.

Desired behaviour

{code:python}
>>> col("wąż")                                                                  
>>>                                                                             
>>>                              
Column<'wąż'>
{code}

or

{code:python}
>>> col("wąż")                                                                  
>>>                                                                             
>>>                              
Column<wąż>
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to