[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread superbobry
Github user superbobry commented on the issue:

https://github.com/apache/spark/pull/23008
  
Interestingly, `cloudpickle` adds overhead even if the namedtuple is 
importable:

```bash
$ cat a.py 
from collections import namedtuple
A = namedtuple("A", ["foo", "bar"])
$ python -c "from a import A; import cloudpickle; 
print(len(cloudpickle.dumps(A(42, 24"
30
$ python -c "from a import A; import pickle; print(len(pickle.dumps(A(42, 
24"
20
```

If the namedtuple is not importable, the size of the result explodes 
because `cloudpickle` includes a full class definition along with all the 
docstrings with *every* pickled object:

```python
>>> from collections import namedtuple
>>> A = namedtuple("A", ["foo", "bar"])
>>> import cloudpickle
>>> len(cloudpickle.dumps(A(42, 24)))
3836
>>> import pickle
>>> len(pickle.dumps(A(42, 24)))
27
```

Note that the order of magnitude is incomparable to what PySpark does 
currently:

```python
>>> import pyspark
>>> A = namedtuple("A", ["foo", "bar"])
>>> len(pickle.dumps(A(42, 24)))
79
``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23008
  
BTW, let.s test them in end-to-end. For instance, 
`spark.range(1).rdd.map(lambda blabla).count()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23008
  
If the perf diff is big, let's don't change but document that we can use 
`CloudPickleSerializer()` to avoid breaking change.

If the perf diff is rather trivial, let's check if we can keep this change. 
I will help to check the perf in this case as well.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23008
  
Nope, it should be manually done.. should be great to have it FWIW.

I am not yet sure how we're going to measure the performance. I think you 
can show the performance diff for namedtuple for now - that's going to at the 
very least show some numbers to compare.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-12 Thread superbobry
Github user superbobry commented on the issue:

https://github.com/apache/spark/pull/23008
  
Is there a benchmark suite for PySpark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23008
  
**[Test build #98710 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)**
 for PR 23008 at commit 
[`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23008
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98710/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23008
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23008
  
**[Test build #98710 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)**
 for PR 23008 at commit 
[`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23008
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23008
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23008
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-11-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23008
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org