[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:none}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  # Putting fields in alphabetical order masks the issue
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:none}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   # Putting fields in alphabetical order masks the issue
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:python}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:python}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:none}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:python}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}


  was:
bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) should 
be accessed by field name, not by position because `Row.__new__` sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}


  was:
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
> accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)}}


  was:
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



```
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```


> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
> accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



```
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```

  was:
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.


```
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```


> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
> accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> ```
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org