[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52385422 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52385516 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18666/consoleFull) for PR 1912 at commit

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52386383 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18666/consoleFull) for PR 1912 at commit

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16327343 --- Diff: python/pyspark/broadcast.py --- @@ -52,17 +50,38 @@ class Broadcast(object): Access its value through C{.value}. -

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52409331 I guess we don't necessarily want to expose `destroy()` to the end-user, since it's private in the Scala APIs. I suppose we might still be leaking broadcast variables

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52409343 Actually, I'm just going to merge this now and I'll add the docstring as part of a subsequent documentation-improvement PR (I also want to edit some Scala / Java docs,

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52409390 I've merged this into `master` and `branch-1.1`. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-16 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1912 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52277037 I had add Broadcast.unpersist(blocking=False). Because we have an copy in disks, so read it from there when user want to access it driver, then we can keep the

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52377800 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52377931 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18647/consoleFull) for PR 1912 at commit

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52379131 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18647/consoleFull) for PR 1912 at commit

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52248668 @frol After fixing your local test, are you still noticing any broadcast performance issues? If you still see any odd behavior, could you post a small script or set

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread frol
Github user frol commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52251712 @JoshRosen No, I'm not noticing any broadcast performance issues now. PySpark works like a charm again. Thank you! --- If your project is set up for it, you can reply to

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16279926 --- Diff: python/pyspark/broadcast.py --- @@ -52,17 +47,31 @@ class Broadcast(object): Access its value through C{.value}. -

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52273858 It occurs to me: what if we had .value retrieve and depickle the value from the JVM? Also, won't we still experience memory leaks in the JVM if we iteratively create

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52276573 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18598/consoleFull) for PR 1912 at commit

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16159823 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -315,6 +315,15 @@ private[spark] object PythonRDD extends Logging {

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread andrewor14
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52015135 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52016367 QA tests have started for PR 1912. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18430/consoleFull ---

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread andrewor14
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52016718 I was talking to Jenkins when I said test this please, but thanks @davies for adding tests too. --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52017380 LoL, I realized this just after pushing the commit :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52019343 QA results for PR 1912:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52076888 QA tests have started for PR 1912. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18447/consoleFull ---

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52076907 QA results for PR 1912:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread frol
Github user frol commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52079674 @davies I am about to test it again with CompressedSerializer. Am I right that I don't need to change anything in my project, but just rebuild Spark? --- If your project

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52094561 @frol , Yes, thanks again! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52095557 QA tests have started for PR 1912. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18456/consoleFull ---

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16199114 --- Diff: python/pyspark/broadcast.py --- @@ -19,18 +19,13 @@ from pyspark.context import SparkContext sc = SparkContext('local', 'test')

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16199423 --- Diff: python/pyspark/rdd.py --- @@ -1809,7 +1809,8 @@ def _jrdd(self): self._jrdd_deserializer = NoOpSerializer() command

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16199737 --- Diff: python/pyspark/context.py --- @@ -562,17 +562,24 @@ def union(self, rdds): rest = ListConverter().convert(rest,

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52099498 This looks good to me and I'm really glad to read the [JIRA

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16200173 --- Diff: python/pyspark/context.py --- @@ -562,17 +562,24 @@ def union(self, rdds): rest = ListConverter().convert(rest,

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52101740 QA results for PR 1912:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52103597 QA tests have started for PR 1912. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18460/consoleFull ---

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52110417 QA results for PR 1912:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread frol
Github user frol commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52119630 @davies Compression improved things, but my tasks have heavy computations inside, so it saved only 10 seconds on a 4.5-minute task and also about 10-20 seconds on a

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52121865 @frol , The big win of compression maybe save the memory in JVM. It's also a win if it does not increase the runtime. If the future, we could try LZ4, it may help a

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-13 Thread frol
Github user frol commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52122539 @davies I'm talking about memory in Python workers and it is my issue. (I figured out that my local test had a mistake and after I fix it local test and Spark Python

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-12 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1912 [SPARK-1065] [PySpark] improve supporting for large broadcast Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-12 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-51999152 QA results for PR 1912:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-12 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52001814 failed tests were not related to this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project