[jira] [Created] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

David Lyness (JIRA) Wed, 28 Nov 2018 05:40:06 -0800

David Lyness created SPARK-26200:
------------------------------------

             Summary: Column values are incorrectly transposed when a field in 
a PySpark Row requires serialization
                 Key: SPARK-26200
                 URL: https://issues.apache.org/jira/browse/SPARK-26200
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.0
         Environment: {noformat}
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/


Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
Branch
Compiled by user on 2018-10-29T06:22:05Z{noformat}
The same issue is observed when PySpark is run on both macOS 10.13.6 and CentOS 
7, so this appears to be a cross-platform issue.
            Reporter: David Lyness


h2. Description of issue

Whenever a field in a PySpark {{Row}} requires serialization (such as a 
{{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
will assign column values *in alphabetical order*, rather than assigning each 
column value to its specified columns.
h3. Code to reproduce:
{code:java}
import datetime
from pyspark.sql import Row
from pyspark.sql.session import SparkSession
from pyspark.sql.types import DateType, StringType, StructField, StructType


spark = SparkSession.builder.getOrCreate()
schema = StructType([
    StructField("date_column", DateType()),
    StructField("my_b_column", StringType()),
    StructField("my_a_column", StringType()),
])

spark.createDataFrame([Row(
    date_column=datetime.date.today(),
    my_b_column="my_b_value",
    my_a_column="my_a_value"
)], schema).show()
{code}
h3. Expected result:
{noformat}
+-----------+-----------+-----------+
|date_column|my_b_column|my_a_column|
+-----------+-----------+-----------+
| 2018-11-28| my_b_value| my_a_value|
+-----------+-----------+-----------+{noformat}
h3. Actual result:
{noformat}
+-----------+-----------+-----------+
|date_column|my_b_column|my_a_column|
+-----------+-----------+-----------+
| 2018-11-28| my_a_value| my_b_value|
+-----------+-----------+-----------+{noformat}
(Note that {{my_a_value}} and {{my_b_value}} are transposed.)
h2. Analysis of issue

Reviewing [the relevant code on 
GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
 there are two relevant conditional blocks:

 
{code:java}
if self._needSerializeAnyField:
    # Block 1, does not work correctly
else:
    # Block 2, works correctly
{code}
{{Row}} is implemented as both a tuple of alphabetically-sorted columns, and a 
dictionary of named columns. In Block 2, there is a conditional that works 
specifically to serialize a {{Row}} object:

 
{code:java}
elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
    return tuple(obj[n] for n in self.names)
{code}
There is no such condition in Block 1, so we fall into this instead:

 
{code:java}
elif isinstance(obj, (tuple, list)):
    return tuple(f.toInternal(v) if c else v
        for f, v, c in zip(self.fields, obj, self._needConversion))
{code}
The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
return a different ordering than the schema fields. So we end up with:
{code:java}
(date, date, True),
(b, a, False),
(a, b, False)
{code}
h2. Workarounds

Correct behaviour is observed if you use a Python {{list}} or {{dict}} instead 
of PySpark's {{Row}} object:

 
{code:java}
# Using a list works
spark.createDataFrame([[
    datetime.date.today(),
    "my_b_value",
    "my_a_value"
]], schema)

# Using a dict also works
spark.createDataFrame([{
    "date_column": datetime.date.today(),
    "my_b_column": "my_b_value",
    "my_a_column": "my_a_value"
}], schema){code}
Correct behaviour is also observed if you have no fields that require 
serialization; in this example, changing {{date_column}} to {{StringType}} 
avoids the correctness issue.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

Reply via email to