[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:none} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ # Putting fields in alphabetical order masks the issue StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:none} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > # Putting fields in alphabetical order masks the issue > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:python} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:python} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:none} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:python} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{ > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} was: bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{ > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} was: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2)}} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be > accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{ > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2)}} was: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. ``` from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) ``` > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be > accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. ``` from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) ``` was: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. ``` from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) ``` > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be > accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > ``` > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org