[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/23008 Interestingly, `cloudpickle` adds overhead even if the namedtuple is importable: ```bash $ cat a.py from collections import namedtuple A = namedtuple("A", ["foo", "bar"]) $ python -c "from a import A; import cloudpickle; print(len(cloudpickle.dumps(A(42, 24" 30 $ python -c "from a import A; import pickle; print(len(pickle.dumps(A(42, 24" 20 ``` If the namedtuple is not importable, the size of the result explodes because `cloudpickle` includes a full class definition along with all the docstrings with *every* pickled object: ```python >>> from collections import namedtuple >>> A = namedtuple("A", ["foo", "bar"]) >>> import cloudpickle >>> len(cloudpickle.dumps(A(42, 24))) 3836 >>> import pickle >>> len(pickle.dumps(A(42, 24))) 27 ``` Note that the order of magnitude is incomparable to what PySpark does currently: ```python >>> import pyspark >>> A = namedtuple("A", ["foo", "bar"]) >>> len(pickle.dumps(A(42, 24))) 79 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 BTW, let.s test them in end-to-end. For instance, `spark.range(1).rdd.map(lambda blabla).count()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 If the perf diff is big, let's don't change but document that we can use `CloudPickleSerializer()` to avoid breaking change. If the perf diff is rather trivial, let's check if we can keep this change. I will help to check the perf in this case as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 Nope, it should be manually done.. should be great to have it FWIW. I am not yet sure how we're going to measure the performance. I think you can show the performance diff for namedtuple for now - that's going to at the very least show some numbers to compare. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/23008 Is there a benchmark suite for PySpark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23008 **[Test build #98710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)** for PR 23008 at commit [`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23008 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98710/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23008 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23008 **[Test build #98710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98710/testReport)** for PR 23008 at commit [`9a81879`](https://github.com/apache/spark/commit/9a818797603f5804b32202d28474493c80966f58). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23008 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23008 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23008 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23008: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23008 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org