[3/6] spark git commit: [SPARK-4897] [PySpark] Python 3 support

2015-04-16 Thread joshrosen
http://git-wip-us.apache.org/repos/asf/spark/blob/04e44b37/python/pyspark/sql/_types.py -- diff --git a/python/pyspark/sql/_types.py b/python/pyspark/sql/_types.py new file mode 100644 index 000..492c0cb --- /dev/null +++ b/pyt

spark git commit: [SPARK-6886] [PySpark] fix big closure with shuffle

2015-04-15 Thread joshrosen
in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD. This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy. cc JoshRo

spark git commit: Revert "[SPARK-5634] [core] Show correct message in HS when no incomplete apps f..."

2015-04-15 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 964f54478 -> 8e9fc27aa Revert "[SPARK-5634] [core] Show correct message in HS when no incomplete apps f..." This reverts commit 5845a62361c39eb97df5de01c982821c8858de76. This was reverted because it broke compilation for branch-1.2.

spark git commit: [SPARK-6886] [PySpark] fix big closure with shuffle

2015-04-15 Thread joshrosen
in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD. This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy. cc JoshRo

spark git commit: [SPARK-6886] [PySpark] fix big closure with shuffle

2015-04-15 Thread joshrosen
hon may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD. This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy. cc JoshRosen Aut

spark git commit: Revert "[SPARK-6352] [SQL] Add DirectParquetOutputCommitter"

2015-04-14 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 4d4b24927 -> a76b921a9 Revert "[SPARK-6352] [SQL] Add DirectParquetOutputCommitter" This reverts commit b29663eeea440b1d1a288d41b5ddf67e77c5bd54. I'm reverting this because it broke test compilation for the Hadoop 1.x profiles. Project:

spark git commit: [SPARK-6905] Upgrade to snappy-java 1.1.1.7

2015-04-14 Thread joshrosen
tps://github.com/xerial/snappy-java/issues/100). Author: Josh Rosen Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bcbf0a1a7bfe2990de18c59c66cd7a0aeb8) Signed-off-by: Josh Rosen Confli

spark git commit: [SPARK-6905] Upgrade to snappy-java 1.1.1.7

2015-04-14 Thread joshrosen
tps://github.com/xerial/snappy-java/issues/100). Author: Josh Rosen Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. (cherry picked from commit 6adb8bcbf0a1a7bfe2990de18c59c66cd7a0aeb8) Signed-off-by: Josh Rosen Confli

spark git commit: [SPARK-6905] Upgrade to snappy-java 1.1.1.7

2015-04-14 Thread joshrosen
tps://github.com/xerial/snappy-java/issues/100). Author: Josh Rosen Closes #5512 from JoshRosen/snappy-1.1.1.7 and squashes the following commits: f1ac0f8 [Josh Rosen] Upgrade to snappy-java 1.1.1.7. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/

spark git commit: [SPARK-6677] [SQL] [PySpark] fix cached classes

2015-04-11 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 ea13948b9 -> 8d4176132 [SPARK-6677] [SQL] [PySpark] fix cached classes It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by give

spark git commit: [SPARK-6677] [SQL] [PySpark] fix cached classes

2015-04-11 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0cc8fcb4c -> 5d8f7b9e8 [SPARK-6677] [SQL] [PySpark] fix cached classes It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given da

spark git commit: [HOTFIX] Add explicit return types to fix lint errors

2015-04-11 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 5c2844c51 -> dea5dacc5 [HOTFIX] Add explicit return types to fix lint errors Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dea5dacc Tree: http://git-wip-us.apache.org

spark git commit: [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey.

2015-04-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 7a1583917 -> daec1c635 [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved i

spark git commit: [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey.

2015-04-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 ec3e76f1e -> 48321b83d [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved i

spark git commit: [SPARK-6216] [PySpark] check the python version in worker

2015-04-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0375134f4 -> 4740d6a15 [SPARK-6216] [PySpark] check the python version in worker Author: Davies Liu Closes #5404 from davies/check_version and squashes the following commits: e559248 [Davies Liu] add tests ec33b5f [Davies Liu] check the

spark git commit: [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey.

2015-04-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master b9baa4cd9 -> 0375134f4 [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in ra

spark git commit: [SPARK-6753] Clone SparkConf in ShuffleSuite tests

2015-04-08 Thread joshrosen
hat subclass ShuffleSuite.scala. This commit fixes that problem. JoshRosen would be great if you could take a look at this, since you wrote this test originally. Author: Kay Ousterhout Closes #5401 from kayousterhout/SPARK-6753 and squashes the following commits: 368c540 [Kay Ousterhout] [SPARK-6

spark git commit: [SPARK-6753] Clone SparkConf in ShuffleSuite tests

2015-04-08 Thread joshrosen
hat subclass ShuffleSuite.scala. This commit fixes that problem. JoshRosen would be great if you could take a look at this, since you wrote this test originally. Author: Kay Ousterhout Closes #5401 from kayousterhout/SPARK-6753 and squashes the following commits: 368c540 [Kay Ousterhout] [SPARK-6

spark git commit: [SPARK-6753] Clone SparkConf in ShuffleSuite tests

2015-04-08 Thread joshrosen
hat subclass ShuffleSuite.scala. This commit fixes that problem. JoshRosen would be great if you could take a look at this, since you wrote this test originally. Author: Kay Ousterhout Closes #5401 from kayousterhout/SPARK-6753 and squashes the following commits: 368c540 [Kay Ousterhout] [SPARK-6

spark git commit: [SPARK-6753] Clone SparkConf in ShuffleSuite tests

2015-04-08 Thread joshrosen
hat subclass ShuffleSuite.scala. This commit fixes that problem. JoshRosen would be great if you could take a look at this, since you wrote this test originally. Author: Kay Ousterhout Closes #5401 from kayousterhout/SPARK-6753 and squashes the following commits: 368c540 [Kay Ousterhout] [SPARK-6

spark git commit: [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed...

2015-04-08 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 cdef7d080 -> e967ecaca [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed... In particular, this makes pyspark in yarn-cluster mode fail unless SPARK_HOME is set, when it's not really needed. Author: Marcelo

spark git commit: [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed...

2015-04-08 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 15e0d2bd1 -> f7e21dd1e [SPARK-6506] [pyspark] Do not try to retrieve SPARK_HOME when not needed... In particular, this makes pyspark in yarn-cluster mode fail unless SPARK_HOME is set, when it's not really needed. Author: Marcelo Van

spark git commit: [SPARK-6737] Fix memory leak in OutputCommitCoordinator

2015-04-07 Thread joshrosen
ta structures. Author: Josh Rosen Closes #5397 from JoshRosen/SPARK-6737 and squashes the following commits: af3b02f [Josh Rosen] Consolidate stage completion handling code in a single method. e96ce3a [Josh Rosen] Consolidate stage completion handling code in a single method. 3052aea [Josh Rosen] C

spark git commit: [SPARK-6737] Fix memory leak in OutputCommitCoordinator

2015-04-07 Thread joshrosen
ures. Author: Josh Rosen Closes #5397 from JoshRosen/SPARK-6737 and squashes the following commits: af3b02f [Josh Rosen] Consolidate stage completion handling code in a single method. e96ce3a [Josh Rosen] Consolidate stage completion handling code in a single method. 3052aea [Josh Rosen] C

spark git commit: [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py

2015-04-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 1cde04f21 -> ab1b8edb8 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py The spark_ec2.py script uses public_dns_name everywhere in the script except for testing ssh availability, which is done using the public ip address

spark git commit: [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py

2015-04-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master a0846c4b6 -> 6f0d55d76 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py The spark_ec2.py script uses public_dns_name everywhere in the script except for testing ssh availability, which is done using the public ip address of

spark git commit: [SPARK-6716] Change SparkContext.DRIVER_IDENTIFIER from to driver

2015-04-06 Thread joshrosen
PI by metrics users, but it's probably okay to do this in a major release as long as we document it in the release notes. Author: Josh Rosen Closes #5372 from JoshRosen/driver-id-fix and squashes the following commits: 42d3c10 [Josh Rosen] Clarify comment 0c5d04b [Josh Rosen] Add backward

spark git commit: [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (branch-1.2)

2015-04-05 Thread joshrosen
ng a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. (See #4944 for the corresponding PR for 1.3/1.4). Author: Josh Rosen Closes #5174 from JoshRosen/executorclassloaderleak-branch-1.2 and squa

spark git commit: [SPARK-6621][Core] Fix the bug that calling EventLoop.stop in EventLoop.onReceive/onError/onStart doesn't call onStop

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 6e1c1ec67 -> 440ea31b7 [SPARK-6621][Core] Fix the bug that calling EventLoop.stop in EventLoop.onReceive/onError/onStart doesn't call onStop Author: zsxwing Closes #5280 from zsxwing/SPARK-6621 and squashes the following commits: 521125

spark git commit: [SPARK-6621][Core] Fix the bug that calling EventLoop.stop in EventLoop.onReceive/onError/onStart doesn't call onStop

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 d21f77988 -> ac705aa83 [SPARK-6621][Core] Fix the bug that calling EventLoop.stop in EventLoop.onReceive/onError/onStart doesn't call onStop Author: zsxwing Closes #5280 from zsxwing/SPARK-6621 and squashes the following commits: 52

spark git commit: SPARK-6414: Spark driver failed with NPE on job cancelation

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 a73055f7f -> 8fa09a480 SPARK-6414: Spark driver failed with NPE on job cancelation Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6

spark git commit: SPARK-6414: Spark driver failed with NPE on job cancelation

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 a6664dcd8 -> 58e2b3fcd SPARK-6414: Spark driver failed with NPE on job cancelation Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6

spark git commit: SPARK-6414: Spark driver failed with NPE on job cancelation

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0cce5451a -> e3202aa2e SPARK-6414: Spark driver failed with NPE on job cancelation Use Option for ActiveJob.properties to avoid NPE bug Author: Hung Lin Closes #5124 from hunglin/SPARK-6414 and squashes the following commits: 2290b6b [H

spark git commit: [SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup()

2015-04-02 Thread joshrosen
ive operation if there are many (e.g. thousands) of retained jobs. This patch adds a new map to `JobProgressListener` in order to speed up these lookups. Author: Josh Rosen Closes #4830 from JoshRosen/statustracker-job-group-indexing and squashes the following commits: e39c5c7 [Josh Rosen] Addr

spark git commit: [SPARK-6667] [PySpark] remove setReuseAddress

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 758ebf77d -> a73055f7f [SPARK-6667] [PySpark] remove setReuseAddress The reused address on server side had caused the server can not acknowledge the connected connections, remove it. This PR will retry once after timeout, it also add

spark git commit: [SPARK-6667] [PySpark] remove setReuseAddress

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 1160cc9e1 -> ee2bd70a4 [SPARK-6667] [PySpark] remove setReuseAddress The reused address on server side had caused the server can not acknowledge the connected connections, remove it. This PR will retry once after timeout, it also add

spark git commit: [SPARK-6667] [PySpark] remove setReuseAddress

2015-04-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 424e987df -> 0cce5451a [SPARK-6667] [PySpark] remove setReuseAddress The reused address on server side had caused the server can not acknowledge the connected connections, remove it. This PR will retry once after timeout, it also add a ti

spark git commit: [SPARK-6553] [pyspark] Support functools.partial as UDF

2015-04-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 86b439935 -> 757b2e917 [SPARK-6553] [pyspark] Support functools.partial as UDF Use `f.__repr__()` instead of `f.__name__` when instantiating `UserDefinedFunction`s, so `functools.partial`s may be used. Author: ksonj Closes #5206 from ks

spark git commit: [SPARK-6553] [pyspark] Support functools.partial as UDF

2015-04-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 bc04fa2e2 -> 98f72dfc1 [SPARK-6553] [pyspark] Support functools.partial as UDF Use `f.__repr__()` instead of `f.__name__` when instantiating `UserDefinedFunction`s, so `functools.partial`s may be used. Author: ksonj Closes #5206 fro

spark git commit: [SPARK-6614] OutputCommitCoordinator should clear authorized committer only after authorized committer fails, not after any failure

2015-03-31 Thread joshrosen
potting this issue. Author: Josh Rosen Closes #5276 from JoshRosen/SPARK-6614 and squashes the following commits: d532ba7 [Josh Rosen] Check whether failed task was authorized committer cbb3784 [Josh Rosen] Add regression test for SPARK-6614 Project: http://git-wip-us.apache.org/repos/asf/spark

spark git commit: [SPARK-6614] OutputCommitCoordinator should clear authorized committer only after authorized committer fails, not after any failure

2015-03-31 Thread joshrosen
potting this issue. Author: Josh Rosen Closes #5276 from JoshRosen/SPARK-6614 and squashes the following commits: d532ba7 [Josh Rosen] Check whether failed task was authorized committer cbb3784 [Josh Rosen] Add regression test for SPARK-6614 Project: http://git-wip-us.apache.org/repos/asf/spark/repo

spark git commit: [SPARK-3266] Use intermediate abstract classes to fix type erasure issues in Java APIs

2015-03-24 Thread joshrosen
to this bug. Author: Josh Rosen Closes #5050 from JoshRosen/javardd-si-8905-fix and squashes the following commits: 2feb068 [Josh Rosen] Use intermediate abstract classes to work around SPARK-3266 d5f3e5d [Josh Rosen] Add failing regression tests for SPARK-3266 (cherry picked from com

spark git commit: [SPARK-6219] [Build] Check that Python code compiles

2015-03-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 3b5aaa6a5 -> f17d43b03 [SPARK-6219] [Build] Check that Python code compiles This PR expands the Python lint checks so that they check for obvious compilation errors in our Python code. For example: ``` $ ./dev/lint-python Python lint che

spark git commit: [SPARK-6394][Core] cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler

2015-03-18 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 3db138742 -> 540b2a4ea [SPARK-6394][Core] cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler The current implementation include searching a HashMap many times, we can avoid this. Actually if you look

spark git commit: [SPARK-6313] Add config option to disable file locks/fetchFile cache to ...

2015-03-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 29e39e178 -> febb12308 [SPARK-6313] Add config option to disable file locks/fetchFile cache to ... ...support NFS mounts. This is a work around for now with the goal to find a more permanent solution. https://issues.apache.org/jira/bro

spark git commit: [SPARK-6313] Add config option to disable file locks/fetchFile cache to ...

2015-03-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 9ebd6f12e -> a2a94a154 [SPARK-6313] Add config option to disable file locks/fetchFile cache to ... ...support NFS mounts. This is a work around for now with the goal to find a more permanent solution. https://issues.apache.org/jira/bro

spark git commit: [SPARK-6313] Add config option to disable file locks/fetchFile cache to ...

2015-03-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0f673c21f -> 4cca3917d [SPARK-6313] Add config option to disable file locks/fetchFile cache to ... ...support NFS mounts. This is a work around for now with the goal to find a more permanent solution. https://issues.apache.org/jira/browse/

spark git commit: [SPARK-3266] Use intermediate abstract classes to fix type erasure issues in Java APIs

2015-03-17 Thread joshrosen
to this bug. Author: Josh Rosen Closes #5050 from JoshRosen/javardd-si-8905-fix and squashes the following commits: 2feb068 [Josh Rosen] Use intermediate abstract classes to work around SPARK-3266 d5f3e5d [Josh Rosen] Add failing regression tests for SPARK-3266 (cherry picked from com

spark git commit: [SPARK-3266] Use intermediate abstract classes to fix type erasure issues in Java APIs

2015-03-17 Thread joshrosen
his bug. Author: Josh Rosen Closes #5050 from JoshRosen/javardd-si-8905-fix and squashes the following commits: 2feb068 [Josh Rosen] Use intermediate abstract classes to work around SPARK-3266 d5f3e5d [Josh Rosen] Add failing regression tests for SPARK-3266 Project: http://git-wip-us.apache.

spark git commit: [SPARK-6327] [PySpark] fix launch spark-submit from python

2015-03-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master f149b8b5e -> e3f315ac3 [SPARK-6327] [PySpark] fix launch spark-submit from python SparkSubmit should be launched without setting PYSPARK_SUBMIT_ARGS cc JoshRosen , this mode is actually used by python unit test, so I will not add m

spark git commit: [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect()

2015-03-13 Thread joshrosen
ory leak in collect(), which may consume lots of memory in JVM. This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python. cc JoshRosen Author: Davies Liu Clo

spark git commit: [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect()

2015-03-09 Thread joshrosen
ory leak in collect(), which may consume lots of memory in JVM. This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python. cc JoshRosen Author: Davies Liu Clo

spark git commit: [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect()

2015-03-09 Thread joshrosen
ory leak in collect(), which may consume lots of memory in JVM. This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python. cc JoshRosen Author: Davies Liu Closes #4

spark git commit: [SPARK-6175] Fix standalone executor log links when ephemeral ports or SPARK_PUBLIC_DNS are used

2015-03-05 Thread joshrosen
changed the code that reads the environment variable to do so via `SparkConf.getenv`, then used a custom SparkConf subclass to mock the environment variable (this pattern is used elsewhere in Spark's tests). Author: Josh Rosen Closes #4903 from JoshRosen/SPARK-6175 and squashes the foll

spark git commit: [SPARK-6175] Fix standalone executor log links when ephemeral ports or SPARK_PUBLIC_DNS are used

2015-03-05 Thread joshrosen
C_DNS, I changed the code that reads the environment variable to do so via `SparkConf.getenv`, then used a custom SparkConf subclass to mock the environment variable (this pattern is used elsewhere in Spark's tests). Author: Josh Rosen Closes #4903 from JoshRosen/SPARK-6175 and squashes

spark git commit: [SPARK-6075] Fix bug in that caused lost accumulator updates: do not store WeakReferences in localAccums map

2015-02-28 Thread joshrosen
k references here anyways, since this map is cleared at the end of each task. Author: Josh Rosen Closes #4835 from JoshRosen/SPARK-6075 and squashes the following commits: 4f4b5b2 [Josh Rosen] Remove defensive assertions that caused test failures in code unrelated to this change 120c7b0 [Josh Rose

spark git commit: [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType

2015-02-27 Thread joshrosen
erSchema (avoid the unnecessary converter of object). cc pwendell JoshRosen Author: Davies Liu Closes #4808 from davies/leak and squashes the following commits: 6a322a4 [Davies Liu] tests refactor 3da44fc [Davies Liu] fix __eq__ of Singleton 534ac90 [Davies Liu] add more checks 46999dc [Davies

spark git commit: [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType

2015-02-27 Thread joshrosen
erSchema (avoid the unnecessary converter of object). cc pwendell JoshRosen Author: Davies Liu Closes #4808 from davies/leak and squashes the following commits: 6a322a4 [Davies Liu] tests refactor 3da44fc [Davies Liu] fix __eq__ of Singleton 534ac90 [Davies Liu] add more checks 46999dc [Da

spark git commit: [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.1)

2015-02-27 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 814934da6 -> 91d0effb3 [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.1) The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes

spark git commit: [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.2)

2015-02-27 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 17b7cc733 -> 576fc54e5 [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.2) The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes

spark git commit: [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe

2015-02-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master cfff397f0 -> 7fa960e65 [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe Removing elements from a mutable HashSet while iterating over it can cause the iteration to incorrectly skip over entries that were not removed.

spark git commit: [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe

2015-02-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 015895ab5 -> cc7313d09 [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe Removing elements from a mutable HashSet while iterating over it can cause the iteration to incorrectly skip over entries that were not remov

spark git commit: [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe

2015-02-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 dafb3d210 -> 5d309ad6c [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe Removing elements from a mutable HashSet while iterating over it can cause the iteration to incorrectly skip over entries that were not remov

spark git commit: [SPARK-3885] Provide mechanism to remove accumulators once they are no longer used

2015-02-22 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e4f9d03d7 -> 95cd643aa [SPARK-3885] Provide mechanism to remove accumulators once they are no longer used Instead of storing a strong reference to accumulators, I've replaced this with a weak reference and updated any code that uses these

spark git commit: [SPARK-911] allow efficient queries for a range if RDD is partitioned wi...

2015-02-22 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 275b1bef8 -> e4f9d03d7 [SPARK-911] allow efficient queries for a range if RDD is partitioned wi... ...th RangePartitioner Author: Aaron Josephs Closes #1381 from aaronjosephs/PLAT-911 and squashes the following commits: e30ade5 [Aaron J

spark git commit: [SPARK-4454] Revert getOrElse() cleanup in DAGScheduler.getCacheLocs()

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 07a401a7b -> 7e5e4d82b [SPARK-4454] Revert getOrElse() cleanup in DAGScheduler.getCacheLocs() This method is performance-sensitive and this change wasn't necessary. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: ht

spark git commit: [SPARK-4454] Revert getOrElse() cleanup in DAGScheduler.getCacheLocs()

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master d46d6246d -> a51fc7ef9 [SPARK-4454] Revert getOrElse() cleanup in DAGScheduler.getCacheLocs() This method is performance-sensitive and this change wasn't necessary. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http:/

spark git commit: [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 07d8ef9e7 -> 81202350a [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle s

spark git commit: [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 117121a4e -> c3d2b90bd [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage

spark git commit: [SPARK-4172] [PySpark] Progress API in Python

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master de4836f8f -> 445a755b8 [SPARK-4172] [PySpark] Progress API in Python This patch bring the pull based progress API into Python, also a example in Python. Author: Davies Liu Closes #3027 from davies/progress_api and squashes the following

spark git commit: [SPARK-4172] [PySpark] Progress API in Python

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 e65dc1fd5 -> 35e23ff14 [SPARK-4172] [PySpark] Progress API in Python This patch bring the pull based progress API into Python, also a example in Python. Author: Davies Liu Closes #3027 from davies/progress_api and squashes the follo

spark git commit: Revert "[SPARK-5363] [PySpark] check ending mark in non-block way"

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 b8da5c390 -> aeb85cdee Revert "[SPARK-5363] [PySpark] check ending mark in non-block way" This reverts commits ac6fe67e1d8bf01ee565f9cc09ad48d88a275829 and c06e42f2c1e5fcf123b466efd27ee4cb53bbed3f. Project: http://git-wip-us.apache.o

spark git commit: Revert "[SPARK-5363] [PySpark] check ending mark in non-block way"

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 432ceca2a -> 6be36d5a8 Revert "[SPARK-5363] [PySpark] check ending mark in non-block way" This reverts commits ac6fe67e1d8bf01ee565f9cc09ad48d88a275829 and c06e42f2c1e5fcf123b466efd27ee4cb53bbed3f. Project: http://git-wip-us.apache.o

spark git commit: Revert "[SPARK-5363] [PySpark] check ending mark in non-block way"

2015-02-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master a65766bf0 -> ee6e3eff0 Revert "[SPARK-5363] [PySpark] check ending mark in non-block way" This reverts commits ac6fe67e1d8bf01ee565f9cc09ad48d88a275829 and c06e42f2c1e5fcf123b466efd27ee4cb53bbed3f. Project: http://git-wip-us.apache.org/r

spark git commit: [SPARK-5363] [PySpark] check ending mark in non-block way

2015-02-16 Thread joshrosen
the ending mark from Python in non-block way, so it will not blocked by Python process. There is a small chance that the ending mark is sent by Python process but not available right now, then Python process will not be used. cc JoshRosen pwendell Author: Davies Liu Closes #4601 from davies/fre

spark git commit: [SPARK-5395] [PySpark] fix python process leak while coalesce()

2015-02-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 f468688f1 -> a39da171c [SPARK-5395] [PySpark] fix python process leak while coalesce() Currently, the Python process is released into pool only after the task had finished, it cause many process forked if coalesce() is called. This PR

spark git commit: [SPARK-5363] [PySpark] check ending mark in non-block way

2015-02-16 Thread joshrosen
ing mark from Python in non-block way, so it will not blocked by Python process. There is a small chance that the ending mark is sent by Python process but not available right now, then Python process will not be used. cc JoshRosen pwendell Author: Davies Liu Closes #4601 from davies/freeze

spark git commit: [SPARK-5363] [PySpark] check ending mark in non-block way

2015-02-16 Thread joshrosen
the ending mark from Python in non-block way, so it will not blocked by Python process. There is a small chance that the ending mark is sent by Python process but not available right now, then Python process will not be used. cc JoshRosen pwendell Author: Davies Liu Closes #4601 from davies/fre

spark git commit: [SPARK-5788] [PySpark] capture the exception in python write thread

2015-02-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 6f47114d9 -> f468688f1 [SPARK-5788] [PySpark] capture the exception in python write thread The exception in Python writer thread will shutdown executor. Author: Davies Liu Closes #4577 from davies/exception and squashes the following

spark git commit: [SPARK-5788] [PySpark] capture the exception in python write thread

2015-02-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 1294a6e01 -> b1bd1dd32 [SPARK-5788] [PySpark] capture the exception in python write thread The exception in Python writer thread will shutdown executor. Author: Davies Liu Closes #4577 from davies/exception and squashes the following com

spark git commit: [SPARK-5788] [PySpark] capture the exception in python write thread

2015-02-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 52994d83b -> c2a9a6176 [SPARK-5788] [PySpark] capture the exception in python write thread The exception in Python writer thread will shutdown executor. Author: Davies Liu Closes #4577 from davies/exception and squashes the following

spark git commit: [SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly

2015-02-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 1af7ca15f -> 6f47114d9 [SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark. It tu

spark git commit: [SPARK-5441][pyspark] Make SerDeUtil PairRDD to Python conversions more robust

2015-02-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 7f19c7c1b -> 1af7ca15f [SPARK-5441][pyspark] Make SerDeUtil PairRDD to Python conversions more robust SerDeUtil.pairRDDToPython and SerDeUtil.pythonToPairRDD now both support empty RDDs by checking the result of take(1) instead of call

spark git commit: [SPARK-1600] Refactor FileInputStream tests to remove Thread.sleep() calls and SystemClock usage (branch-1.2 backport)

2015-02-16 Thread joshrosen
are manually set based off of ManualClock; this eliminates many Thread.sleep calls. - Update these tests to use the withStreamingContext fixture. Author: Josh Rosen Closes #4633 from JoshRosen/spark-1600-b12-backport and squashes the following commits: e5d3dc4 [Josh Rosen] [SPARK-1600] Refacto

spark git commit: [SPARK-2313] Use socket to communicate GatewayServer port back to Python driver

2015-02-16 Thread joshrosen
osen Closes #4603 from JoshRosen/SPARK-2313 and squashes the following commits: 6a7740b [Josh Rosen] Remove EchoOutputThread since it's no longer needed 0db501f [Josh Rosen] Use select() so that we don't block if GatewayServer dies. 9bdb4b6 [Josh Rosen] Handle case where getListening

spark git commit: [SPARK-2313] Use socket to communicate GatewayServer port back to Python driver

2015-02-16 Thread joshrosen
r: Josh Rosen Closes #4603 from JoshRosen/SPARK-2313 and squashes the following commits: 6a7740b [Josh Rosen] Remove EchoOutputThread since it's no longer needed 0db501f [Josh Rosen] Use select() so that we don't block if GatewayServer dies. 9bdb4b6 [Josh Rosen] Handle case where getL

spark git commit: [SPARK-4905][STREAMING] FlumeStreamSuite fix.

2015-02-09 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 97541b22e -> 63eee523e [SPARK-4905][STREAMING] FlumeStreamSuite fix. Using String constructor instead of CharsetDecoder to see if it fixes the issue of empty strings in Flume test output. Author: Hari Shreedharan Closes #4371 from h

spark git commit: [SPARK-4905][STREAMING] FlumeStreamSuite fix.

2015-02-09 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 6fe70d843 -> 0765af9b2 [SPARK-4905][STREAMING] FlumeStreamSuite fix. Using String constructor instead of CharsetDecoder to see if it fixes the issue of empty strings in Flume test output. Author: Hari Shreedharan Closes #4371 from haris

spark git commit: [SPARK-4905][STREAMING] FlumeStreamSuite fix.

2015-02-09 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 6a0144c63 -> 18c5a999b [SPARK-4905][STREAMING] FlumeStreamSuite fix. Using String constructor instead of CharsetDecoder to see if it fixes the issue of empty strings in Flume test output. Author: Hari Shreedharan Closes #4371 from h

spark git commit: [HOTFIX] use --driver-java-options instead of --conf for branch-1.0

2015-02-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 a1425db96 -> 4b9234905 [HOTFIX] use --driver-java-options instead of --conf for branch-1.0 This fixes a build-break caused by b78422ae170b89fa09e8910e247cbfecc23442f8, a previous hotfix. Project: http://git-wip-us.apache.org/repos/asf

spark git commit: SPARK-5425: Use synchronised methods in system properties to create SparkConf

2015-02-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 d89964f86 -> 4bad85485 SPARK-5425: Use synchronised methods in system properties to create SparkConf SPARK-5425: Fixed usages of system properties This patch fixes few problems caused by the fact that the Scala wrapper over system pro

spark git commit: [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and 2.4 profiles

2015-02-07 Thread joshrosen
the hadoop-2.3 or hadoop-2.4 profiles. The jets3t release notes can be found at http://www.jets3t.org/RELEASE_NOTES.html Author: Josh Rosen Closes #4454 from JoshRosen/SPARK-5671 and squashes the following commits: fa6cb3e [Josh Rosen] [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and

spark git commit: [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and 2.4 profiles

2015-02-07 Thread joshrosen
the hadoop-2.3 or hadoop-2.4 profiles. The jets3t release notes can be found at http://www.jets3t.org/RELEASE_NOTES.html Author: Josh Rosen Closes #4454 from JoshRosen/SPARK-5671 and squashes the following commits: fa6cb3e [Josh Rosen] [SPARK-5671] Upgrade jets3t to 0.9.2 in hadoop-2.3 and

spark git commit: SPARK-5403: Ignore UserKnownHostsFile in SSH calls

2015-02-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0e23ca9f8 -> e772b4e4e SPARK-5403: Ignore UserKnownHostsFile in SSH calls See https://issues.apache.org/jira/browse/SPARK-5403 Author: Grzegorz Dubicki Closes #4196 from grzegorz-dubicki/SPARK-5403 and squashes the following commits: a

spark git commit: SPARK-5403: Ignore UserKnownHostsFile in SSH calls

2015-02-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 11b28b9b4 -> 3d99741b2 SPARK-5403: Ignore UserKnownHostsFile in SSH calls See https://issues.apache.org/jira/browse/SPARK-5403 Author: Grzegorz Dubicki Closes #4196 from grzegorz-dubicki/SPARK-5403 and squashes the following commits

spark git commit: SPARK-5633 pyspark saveAsTextFile support for compression codec

2015-02-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 87e0f0dc6 -> 1d3234165 SPARK-5633 pyspark saveAsTextFile support for compression codec See https://issues.apache.org/jira/browse/SPARK-5633 for details Author: Vladimir Vladimirov Closes #4403 from smartkiwi/master and squashes the f

spark git commit: SPARK-5633 pyspark saveAsTextFile support for compression codec

2015-02-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 65181b751 -> b3872e00d SPARK-5633 pyspark saveAsTextFile support for compression codec See https://issues.apache.org/jira/browse/SPARK-5633 for details Author: Vladimir Vladimirov Closes #4403 from smartkiwi/master and squashes the follo

spark git commit: [SPARK-4983] Insert waiting time before tagging EC2 instances

2015-02-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 3d3ecd774 -> 0f3a36071 [SPARK-4983] Insert waiting time before tagging EC2 instances The boto API doesn't support tag EC2 instances in the same call that launches them. We add a five-second wait so EC2 has enough time to propagate the info

spark git commit: [SPARK-4983] Insert waiting time before tagging EC2 instances

2015-02-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.2 09da688b0 -> 36f70de83 [SPARK-4983] Insert waiting time before tagging EC2 instances The boto API doesn't support tag EC2 instances in the same call that launches them. We add a five-second wait so EC2 has enough time to propagate the

spark git commit: [SPARK-4983] Insert waiting time before tagging EC2 instances

2015-02-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.3 2ef9853e7 -> 2872d8344 [SPARK-4983] Insert waiting time before tagging EC2 instances The boto API doesn't support tag EC2 instances in the same call that launches them. We add a five-second wait so EC2 has enough time to propagate the

<    1   2   3   4   5   6   7   8   9   10   >