[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

2018-05-11 Thread BryanCutler
Github user BryanCutler closed the pull request at:

https://github.com/apache/spark/pull/20280


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

2018-04-18 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20280#discussion_r182569705
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
 self.assertEqual(df.schema.simpleString(), 
"struct")
 self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for 
i in range(100)])
 
-# field names can differ.
-df = rdd.toDF(" a: int, b: string ")
--- End diff --

I still think this test is invalid because it only works as the `Row`s get 
serialized and lose the `__from_dict__` flag.  This should be the same think, 
but would fail:

```
df = spark.createDataFrame([Row(key=i, value=str(i)) for i in range(100)], 
schema=" a: int, b: string ")
```
And it would also fail if the named args of the `Row` objects were not in 
the same alphabetical order as the schema:

```
data = [Row(z=i, y=str(i)) for i in range(100)]
rdd = self.sc.parallelize(data, 5)

df = rdd.toDF(" a: int, b: string ")
```
This fails because the Row fields would be sorted in a different order, 
switching the type order.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

2018-01-16 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20280#discussion_r161964785
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
 self.assertEqual(df.schema.simpleString(), 
"struct")
 self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for 
i in range(100)])
 
-# field names can differ.
-df = rdd.toDF(" a: int, b: string ")
--- End diff --

Yeah, I think you're right. The schema should be able to override names no 
matter what.  I was thinking it was flawed because it relied on the row field 
to be sorted internally, so just changing the kwarg names (not the order) could 
cause it to fail.  That seems a little strange, but maybe it is the intent?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

2018-01-16 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20280#discussion_r161929011
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
 self.assertEqual(df.schema.simpleString(), 
"struct")
 self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for 
i in range(100)])
 
-# field names can differ.
-df = rdd.toDF(" a: int, b: string ")
--- End diff --

But wasn't this testing `field names can differ.` by user's schema input?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

2018-01-16 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20280#discussion_r161865195
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2306,18 +2306,20 @@ def test_toDF_with_schema_string(self):
 self.assertEqual(df.schema.simpleString(), 
"struct")
 self.assertEqual(df.collect(), [Row(key=str(i), value=str(i)) for 
i in range(100)])
 
-# field names can differ.
-df = rdd.toDF(" a: int, b: string ")
--- End diff --

This test was flawed because it only worked because ("a", "b") is in the 
same alphabetical order as ("key", "value").  If it was ("key", "aaa") then it 
would fail. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to ...

2018-01-16 Thread BryanCutler
GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/20280

[SPARK-22232][PYTHON][SQL] Fixed Row pickling to include __from_dict__ flag

## What changes were proposed in this pull request?

When a `Row` object is created using kwargs, the order of the keywords can 
not be relied upon  (except for Python 3.5 that uses an OrderedDict).  The 
fields are sorted in the constructor and a flag `__from_dict__` is set to 
indicate that this object was created from kwargs so that other areas in Spark 
can access row data using field names instead of by position.  This change 
includes the `__from_dict__` flag only when pickling a Row that was made from 
kwargs so that the behavior is preserved if the Row becomes pickled.

## How was this patch tested?

Fixed existing tests that relied on fields and schema being in the same 
alphabetical order.  Added new test to create `Row` from positional arguments 
where order matters.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
pyspark-Row-serialize-SPARK-22232

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20280






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org