git commit: [SPARK-2547]:The clustering documentaion example provided for spark 0.9....

2014-07-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-0.9 c37db1537 - 7e4a0e1a0 [SPARK-2547]:The clustering documentaion example provided for spark 0.9 I modified a trivial mistake in the MLlib documentation. I checked that the python sample code for a k-means clustering can correctly

git commit: [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket

2014-07-29 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 1a0a2f81a - 2693035ba [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket During rdd.take(n), JVM will close the socket if it had got enough data, the Python worker should keep silent in this case. In the same time,

git commit: [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle

2014-07-29 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 2693035ba - e0bc72eb7 [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle fix the problem with pickle operator.itemgetter with multiple index. Author: Davies Liu davies@gmail.com Closes #1627 from davies/itemgetter and

[2/2] git commit: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-30 Thread joshrosen
[SPARK-2024] Add saveAsSequenceFile to PySpark JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024 This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats. * Added RDD methods ```saveAsSequenceFile```,

[1/2] [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-30 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 437dc8c5b - 94d1f46fc http://git-wip-us.apache.org/repos/asf/spark/blob/94d1f46f/python/pyspark/tests.py -- diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py

git commit: [SPARK-2737] Add retag() method for changing RDDs' ClassTags.

2014-07-30 Thread joshrosen
with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers. Author: Josh Rosen joshro...@apache.org Closes #1639 from JoshRosen/SPARK

git commit: Improvements to merge_spark_pr.py

2014-07-31 Thread joshrosen
merged. Both of these fixes are useful when backporting changes. Author: Josh Rosen joshro...@apache.org Closes #1668 from JoshRosen/pr-script-improvements and squashes the following commits: ff4f33a [Josh Rosen] Default SPARK_HOME to cwd(); detect missing JIRA credentials. ed5bc57 [Josh Rosen

git commit: Docs: monitoring, streaming programming guide

2014-07-31 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e02136214 - cc820502f Docs: monitoring, streaming programming guide Fix several awkward wordings and grammatical issues in the following documents: * docs/monitoring.md * docs/streaming-programming-guide.md Author: kballou

git commit: [SPARK-1740] [PySpark] kill the python worker

2014-08-03 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e139e2be6 - 55349f9fe [SPARK-1740] [PySpark] kill the python worker Kill only the python worker related to cancelled tasks. The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is

git commit: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e053c5581 - 59f84a953 [SPARK-1687] [PySpark] pickable namedtuple Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE from collections import

git commit: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 3823f6d25 - bfd2f3958 [SPARK-1687] [PySpark] pickable namedtuple Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE from collections import

git commit: [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 aa7a48ee9 - 2225d18a7 [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times. Author:

git commit: [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 8e7d5ba1a - 9fd82dbbc [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times. Author:

git commit: [SPARK-2898] [PySpark] fix bugs in deamon.py

2014-08-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 1d03a26a4 - 28dcbb531 [SPARK-2898] [PySpark] fix bugs in deamon.py 1. do not use signal handler for SIGCHILD, it's easy to cause deadlock 2. handle EINTR during accept() 3. pass errno into JVM 4. handle EAGAIN during fork() Now, it can

git commit: [SPARK-2898] [PySpark] fix bugs in deamon.py

2014-08-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 bb23b118e - 92daffed4 [SPARK-2898] [PySpark] fix bugs in deamon.py 1. do not use signal handler for SIGCHILD, it's easy to cause deadlock 2. handle EINTR during accept() 3. pass errno into JVM 4. handle EAGAIN during fork() Now, it

git commit: [PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 Fixes

2014-08-11 Thread joshrosen
TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed. - Fix MLlib _deserialize_double on Python 2.6. Closes #1868. Closes #1042. Author: Josh Rosen joshro...@apache.org Closes #1874 from JoshRosen/python2.6 and squashes the following commits: 983d259 [Josh Rosen] [SPARK-2954] Fix

git commit: [PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 Fixes

2014-08-11 Thread joshrosen
TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed. - Fix MLlib _deserialize_double on Python 2.6. Closes #1868. Closes #1042. Author: Josh Rosen joshro...@apache.org Closes #1874 from JoshRosen/python2.6 and squashes the following commits: 983d259 [Josh Rosen] [SPARK-2954] Fix

git commit: [SPARK-2931] In TaskSetManager, reset currentLocalityIndex after recomputing locality levels

2014-08-11 Thread joshrosen
is to reset currentLocalityIndex after recomputing the locality levels. Thanks to kayousterhout, mridulm, and lirui-intel for helping me to debug this. Author: Josh Rosen joshro...@apache.org Closes #1896 from JoshRosen/SPARK-2931 and squashes the following commits: 48b60b5 [Josh Rosen] Move

git commit: [SPARK-2931] In TaskSetManager, reset currentLocalityIndex after recomputing locality levels

2014-08-11 Thread joshrosen
here is to reset currentLocalityIndex after recomputing the locality levels. Thanks to kayousterhout, mridulm, and lirui-intel for helping me to debug this. Author: Josh Rosen joshro...@apache.org Closes #1896 from JoshRosen/SPARK-2931 and squashes the following commits: 48b60b5 [Josh Rosen

git commit: [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager

2014-08-16 Thread joshrosen
an opportunity to clean this up later if we sever the circular dependencies between BlockManager and other components and pass those components to BlockManager's constructor. Author: Josh Rosen joshro...@apache.org Closes #1976 from JoshRosen/SPARK-2977 and squashes the following commits: a9cd1e1 [Josh

git commit: [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager

2014-08-16 Thread joshrosen
to clean this up later if we sever the circular dependencies between BlockManager and other components and pass those components to BlockManager's constructor. Author: Josh Rosen joshro...@apache.org Closes #1976 from JoshRosen/SPARK-2977 and squashes the following commits: a9cd1e1 [Josh Rosen

git commit: [SPARK-2677] BasicBlockFetchIterator#next can wait forever

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 8c7957446 - bd3ce2ffb [SPARK-2677] BasicBlockFetchIterator#next can wait forever Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #1632 from sarutak/SPARK-2677 and squashes the following commits: cddbc7b [Kousuke Saruta]

git commit: [SPARK-3035] Wrong example with SparkContext.addFile

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 a12d3ae32 - 721f2fdc9 [SPARK-3035] Wrong example with SparkContext.addFile https://issues.apache.org/jira/browse/SPARK-3035 fix for wrong document. Author: iAmGhost kdh7...@gmail.com Closes #1942 from iAmGhost/master and squashes

git commit: [SPARK-3035] Wrong example with SparkContext.addFile

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master ac6411c6e - 379e7585c [SPARK-3035] Wrong example with SparkContext.addFile https://issues.apache.org/jira/browse/SPARK-3035 fix for wrong document. Author: iAmGhost kdh7...@gmail.com Closes #1942 from iAmGhost/master and squashes the

git commit: [SPARK-1065] [PySpark] improve supporting for large broadcast

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 721f2fdc9 - 5dd571c29 [SPARK-1065] [PySpark] improve supporting for large broadcast Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()). Add an option to keep

git commit: In the stop method of ConnectionManager to cancel the ackTimeoutMonitor

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 2fc8aca08 - bc95fe08d In the stop method of ConnectionManager to cancel the ackTimeoutMonitor cc JoshRosen sarutak Author: GuoQiang Li wi...@qq.com Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following commits

git commit: In the stop method of ConnectionManager to cancel the ackTimeoutMonitor

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 5dd571c29 - f02e327f0 In the stop method of ConnectionManager to cancel the ackTimeoutMonitor cc JoshRosen sarutak Author: GuoQiang Li wi...@qq.com Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following commits

git commit: [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8

2014-08-18 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 3a5962f0f - d1d0ee41c [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8 bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as utf-8. Author: Davies Liu

git commit: [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8

2014-08-18 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 cc4015d2f - e08333463 [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8 bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as utf-8. Author: Davies Liu

git commit: [SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL.

2014-08-18 Thread joshrosen
tests, irrespective of whether SparkSQL itself has been modified. It also includes Davies' fix for the bug. Closes #2026. Author: Josh Rosen joshro...@apache.org Author: Davies Liu davies@gmail.com Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits: 9af2708

git commit: [SPARK-3089] Fix meaningless error message in ConnectionManager

2014-08-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 7eb9cbc27 - cbfc26ba4 [SPARK-3089] Fix meaningless error message in ConnectionManager Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #2000 from sarutak/SPARK-3089 and squashes the following commits: 02dfdea [Kousuke Saruta]

git commit: SPARK-2333 - spark_ec2 script should allow option for existing security group

2014-08-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 31f0b071e - 94053a7b7 SPARK-2333 - spark_ec2 script should allow option for existing security group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to coincide

git commit: SPARK-2333 - spark_ec2 script should allow option for existing security group

2014-08-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 04a320862 - c3952b092 SPARK-2333 - spark_ec2 script should allow option for existing security group - Uses the name tag to identify machines in a cluster. - Allows overriding the security group name so it doesn't need to

git commit: Move a bracket in validateSettings of SparkConf

2014-08-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 c3952b092 - f6b4ab83c Move a bracket in validateSettings of SparkConf Move a bracket in validateSettings of SparkConf Author: hzw19900416 carlmartin...@gmail.com Closes #2012 from hzw19900416/codereading and squashes the following

git commit: Move a bracket in validateSettings of SparkConf

2014-08-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 94053a7b7 - 76eaeb452 Move a bracket in validateSettings of SparkConf Move a bracket in validateSettings of SparkConf Author: hzw19900416 carlmartin...@gmail.com Closes #2012 from hzw19900416/codereading and squashes the following

git commit: [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes.

2014-08-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 76eaeb452 - d7e80c259 [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call

git commit: [Minor] fix typo

2014-08-23 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master f3d65cd0b - 76bb044b9 [Minor] fix typo Fix a typo in comment. Author: Liang-Chi Hsieh vii...@gmail.com Closes #2105 from viirya/fix_typo and squashes the following commits: 6596a80 [Liang-Chi Hsieh] fix typo. Project:

git commit: [SPARK-2871] [PySpark] add approx API for RDD

2014-08-23 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master db436e36c - 8df4dad49 [SPARK-2871] [PySpark] add approx API for RDD RDD.countApprox(self, timeout, confidence=0.95) :: Experimental :: Approximate version of count() that returns a potentially incomplete result

git commit: [FIX] fix error message in sendMessageReliably

2014-08-25 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master cc40a709c - fd8ace2d9 [FIX] fix error message in sendMessageReliably rxin Author: Xiangrui Meng m...@databricks.com Closes #2120 from mengxr/sendMessageReliably and squashes the following commits: b14400c [Xiangrui Meng] fix error

git commit: [SPARK-2871] [PySpark] add histgram() API

2014-08-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 8856c3d86 - 3cedc4f4d [SPARK-2871] [PySpark] add histgram() API RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g.

git commit: [SPARK-2871] [PySpark] add histgram() API

2014-08-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 3a9d874d7 - 83d273023 [SPARK-2871] [PySpark] add histgram() API RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed.

git commit: Fix unclosed HTML tag in Yarn docs.

2014-08-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master be043e3f2 - d8345471c Fix unclosed HTML tag in Yarn docs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8345471 Tree:

git commit: [SPARK-2871] [PySpark] add RDD.lookup(key)

2014-08-27 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 48f42781d - 4fa2fda88 [SPARK-2871] [PySpark] add RDD.lookup(key) RDD.lookup(key) Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only

git commit: Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with Launch More like this

2014-08-27 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 4fa2fda88 - 7faf755ae Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with Launch More like this ... copy the spark_cluster_tag from a spot instance requests over to the instances. Author: Vida Ha v...@databricks.com

git commit: [SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init()

2014-08-28 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 edea1efe0 - 31de05b08 [SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init() The issue happens when Spark is run standalone on a cluster. When master and driver fall

git commit: [SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init()

2014-08-28 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 f8f7a0c9d - fd98020a9 [SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init() The issue happens when Spark is run standalone on a cluster. When master and driver fall

git commit: [SPARK-3190] Avoid overflow in VertexRDD.count()

2014-08-28 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 39012452d - 96df92906 [SPARK-3190] Avoid overflow in VertexRDD.count() VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting

git commit: [SPARK-3190] Avoid overflow in VertexRDD.count()

2014-08-28 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 31de05b08 - 5481196ab [SPARK-3190] Avoid overflow in VertexRDD.count() VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting

git commit: [SPARK-3279] Remove useless field variable in ApplicationMaster

2014-08-28 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 665e71d14 - 27df6ce6a [SPARK-3279] Remove useless field variable in ApplicationMaster Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #2177 from sarutak/SPARK-3279 and squashes the following commits: 2955edc [Kousuke Saruta]

git commit: [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast()

2014-08-29 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 27df6ce6a - e248328b3 [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast() remove invalid docs Author: Davies Liu davies@gmail.com Closes #2202 from davies/keep and squashes the following commits: aa3b44f [Davies Liu]

git commit: [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast()

2014-08-29 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 c71b5c6db - 98d0716a1 [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast() remove invalid docs Author: Davies Liu davies@gmail.com Closes #2202 from davies/keep and squashes the following commits: aa3b44f [Davies

git commit: SPARK-3331 [BUILD] PEP8 tests fail because they check unzipped py4j code

2014-09-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0f16b23cd - 32ec0a8cd SPARK-3331 [BUILD] PEP8 tests fail because they check unzipped py4j code PEP8 tests run on files under ./python, but unzipped py4j code is found at ./python/build/py4j. Py4J code fails style checks and can fail

git commit: [SPARK-3332] Revert spark-ec2 patch that identifies clusters using tags

2014-09-02 Thread joshrosen
instances and logging warnings, or maybe using another mechanism to group instances into clusters. For the 1.1.0 release, though, I propose that we just revert this patch. Author: Josh Rosen joshro...@apache.org Closes #2225 from JoshRosen/revert-ec2-cluster-naming and squashes the following

git commit: [SPARK-2435] Add shutdown hook to pyspark

2014-09-03 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master c5cbc4923 - 7c6e71f05 [SPARK-2435] Add shutdown hook to pyspark Author: Matthew Farrellee m...@redhat.com Closes #2183 from mattf/SPARK-2435 and squashes the following commits: ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook

git commit: [SPARK-3399][PySpark] Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 62c557609 - 7ff8c45d7 [SPARK-3399][PySpark] Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #2270 from sarutak/SPARK-3399 and squashes the following commits:

git commit: Spark-3406 add a default storage level to python RDD persist API

2014-09-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master baff7e936 - da35330e8 Spark-3406 add a default storage level to python RDD persist API Author: Holden Karau hol...@pigscanfly.ca Closes #2280 from holdenk/SPARK-3406-Python-RDD-persist-api-does-not-have-default-storage-level and

git commit: [SPARK-2334] fix AttributeError when call PipelineRDD.id()

2014-09-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 21a1e1bb8 - 110fb8b24 [SPARK-2334] fix AttributeError when call PipelineRDD.id() The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd. The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j

git commit: [SPARK-3415] [PySpark] removes SerializingAdapter code

2014-09-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e2614038e - ecfa76cdf [SPARK-3415] [PySpark] removes SerializingAdapter code This code removes the SerializingAdapter code that was copied from PiCloud Author: Ward Viaene ward.via...@bigdatapartnership.com Closes #2287 from

git commit: Provide a default PYSPARK_PYTHON for python/run_tests

2014-09-08 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 16a73c247 - 386bc24eb Provide a default PYSPARK_PYTHON for python/run_tests Without this the version of python used in the test is not recorded. The error is, Testing with Python version: ./run-tests: line 57: --version: command not

git commit: [HOTFIX] Fix scala style issue introduced by #2276.

2014-09-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master f0c87dc86 - 26503fdf2 [HOTFIX] Fix scala style issue introduced by #2276. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/26503fdf Tree:

git commit: [SPARK-3047] [PySpark] add an option to use str in textFileRDD

2014-09-11 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master ed1980ffa - 1ef656ea8 [SPARK-3047] [PySpark] add an option to use str in textFileRDD str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default.

git commit: [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd

2014-09-12 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 71af030b4 - 885d1621b [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to

git commit: [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd

2014-09-12 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 6cbf83c05 - 9c06c7230 [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to

git commit: [SPARK-3030] [PySpark] Reuse Python worker

2014-09-13 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0f8c4edf4 - 2aea0da84 [SPARK-3030] [PySpark] Reuse Python worker Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts. This

git commit: [SPARK-3463] [PySpark] aggregate and show spilled bytes in Python

2014-09-13 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 2aea0da84 - 4e3fbe8cd [SPARK-3463] [PySpark] aggregate and show spilled bytes in Python Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI.

git commit: [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.

2014-09-15 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master da33acb8b - 60050f428 [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. Also made some cosmetic cleanups. Author: Aaron Staple aaron.sta...@gmail.com Closes #2385 from staple/SPARK-1087 and squashes the

git commit: [Docs] Correct spark.files.fetchTimeout default value

2014-09-17 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 008a5ed48 - 983609a4d [Docs] Correct spark.files.fetchTimeout default value change the value of spark.files.fetchTimeout Author: viper-kun xukun...@huawei.com Closes #2406 from viper-kun/master and squashes the following commits:

git commit: [SPARK-3554] [PySpark] use broadcast automatically for large closure

2014-09-18 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 9306297d1 - e77fa81a6 [SPARK-3554] [PySpark] use broadcast automatically for large closure Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass

git commit: [SPARK-1701] Clarify slice vs partition in the programming guide

2014-09-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master a48956f58 - be0c7563e [SPARK-1701] Clarify slice vs partition in the programming guide This is a partial solution to SPARK-1701, only addressing the documentation confusion. Additional work can be to actually change the numSlices

git commit: [SPARK-1701] [PySpark] remove slice terminology from python examples

2014-09-19 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master be0c7563e - a03e5b81e [SPARK-1701] [PySpark] remove slice terminology from python examples Author: Matthew Farrellee m...@redhat.com Closes #2304 from mattf/SPARK-1701-partition-over-slice-for-python-examples and squashes the following

git commit: Fix Java example in Streaming Programming Guide

2014-09-20 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 78d4220fa - c32c8538e Fix Java example in Streaming Programming Guide val conf was used instead of SparkConf conf in Java snippet. Author: Santiago M. Mola sa...@mola.io Closes #2472 from smola/patch-1 and squashes the following commits:

git commit: [PySpark] remove unnecessary use of numSlices from pyspark tests

2014-09-20 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master c32c8538e - 5f8833c67 [PySpark] remove unnecessary use of numSlices from pyspark tests Author: Matthew Farrellee m...@redhat.com Closes #2467 from mattf/master-pyspark-remove-numslices-from-tests and squashes the following commits:

git commit: [SPARK-3634] [PySpark] User's module should take precedence over system modules

2014-09-24 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 50f863365 - c854b9fcb [SPARK-3634] [PySpark] User's module should take precedence over system modules Python modules added through addPyFile should take precedence over system modules. This patch put the path for user added module in the

git commit: [SPARK-3690] Closing shuffle writers we swallow more important exception

2014-09-25 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master c3f2a8588 - 9b56e249e [SPARK-3690] Closing shuffle writers we swallow more important exception Author: epahomov pahomov.e...@gmail.com Closes #2537 from epahomov/SPARK-3690 and squashes the following commits: a0b7de4 [epahomov]

git commit: SPARK-3745 - fix check-license to properly download and check jar

2014-09-30 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 a8c6e82de - 06b96d4a3 SPARK-3745 - fix check-license to properly download and check jar for details, see: https://issues.apache.org/jira/browse/SPARK-3745 Author: shane knapp incompl...@gmail.com Closes #2596 from

git commit: [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD

2014-10-01 Thread joshrosen
any more. cc JoshRosen , sorry for these stupid bugs. Author: Davies Liu davies@gmail.com Closes #2603 from davies/fix_broadcast and squashes the following commits: 080a743 [Davies Liu] fix bugs in broadcast large closure of RDD Project: http://git-wip-us.apache.org/repos/asf/spark/repo

git commit: SPARK-2626 [DOCS] Stop SparkContext in all examples

2014-10-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master abf588f47 - dcb2f73f1 SPARK-2626 [DOCS] Stop SparkContext in all examples Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it) Author: Sean Owen so...@cloudera.com Closes #2575 from

git commit: SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile

2014-10-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 24ee61625 - c52c231c7 SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile This patch forces use of commons http client 4.2 in Kinesis-asl profile so that the AWS SDK does not run into dependency conflicts

git commit: [SPARK-3446] Expose underlying job ids in FutureAction.

2014-10-01 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 93861a5e8 - 29c351320 [SPARK-3446] Expose underlying job ids in FutureAction. FutureAction is the only type exposed through the async APIs, so for job IDs to be useful they need to be exposed there. The complication is that some async jobs

git commit: [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset

2014-10-02 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 6e27cb630 - 5b4a5b1ac [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset ### Problem The section Using the shell in Spark Programming Guide

git commit: [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModel

2014-10-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master c9ae79fba - 20ea54cc7 [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModel Add a toString method to GeneralizedLinearModel, also change `__str__` to `__repr__` for some classes, to provide better message in repr. This

git commit: [SPARK-3786] [PySpark] speedup tests

2014-10-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 20ea54cc7 - 4f01265f7 [SPARK-3786] [PySpark] speedup tests This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which

git commit: [SPARK-3479] [Build] Report failed test category

2014-10-06 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 2300eb58a - 69c3f441a [SPARK-3479] [Build] Report failed test category This PR allows SparkQA (i.e. Jenkins) to report in its posts to GitHub what category of test failed, if one can be determined. The failure categories are: * general

git commit: [SPARK-3827] Very long RDD names are not rendered properly in web UI

2014-10-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 70e824f75 - d65fd554b [SPARK-3827] Very long RDD names are not rendered properly in web UI With Spark SQL we generate very long RDD names. These names are not properly rendered in the web UI. This PR fixes the rendering issue.

git commit: [SPARK-3827] Very long RDD names are not rendered properly in web UI

2014-10-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 964e3aa48 - 82ab4a796 [SPARK-3827] Very long RDD names are not rendered properly in web UI With Spark SQL we generate very long RDD names. These names are not properly rendered in the web UI. This PR fixes the rendering issue.

git commit: [SPARK-3731] [PySpark] fix memory leak in PythonRDD

2014-10-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 267c7be3b - 553183024 [SPARK-3731] [PySpark] fix memory leak in PythonRDD The parent.getOrCompute() of PythonRDD is executed in a separated thread, it should release the memory reserved for shuffle and unrolling finally. Author:

git commit: [SPARK-3398] [EC2] Have spark-ec2 intelligently wait for specific cluster states

2014-10-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master b32bb72e8 - 5912ca671 [SPARK-3398] [EC2] Have spark-ec2 intelligently wait for specific cluster states Instead of waiting arbitrary amounts of time for the cluster to reach a specific state, this patch lets `spark-ec2` explicitly wait for

git commit: [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs

2014-10-07 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master b69c9fb6f - 798ed22c2 [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs Retire Epydoc, use Sphinx to generate API docs. Refine Sphinx docs, also convert some docstrings into Sphinx style. It looks like: ![api

git commit: Fetch from branch v4 in Spark EC2 script.

2014-10-08 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master bcb1ae049 - f706823b7 Fetch from branch v4 in Spark EC2 script. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f706823b Tree:

git commit: [SPARK-3741] Make ConnectionManager propagate errors properly and add mo...

2014-10-09 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 1e0aa4deb - 73bf3f2e0 [SPARK-3741] Make ConnectionManager propagate errors properly and add mo... ...re logs to avoid Executors swallowing errors This PR made the following changes: * Register a callback to `Connection` so that the error

git commit: [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log

2014-10-09 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 2c8851343 - e7edb723d [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log ./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log.

git commit: [SPARK-3772] Allow `ipython` to be used by Pyspark workers; IPython support improvements:

2014-10-09 Thread joshrosen
(to avoid breaking existing example programs). There are more details in a block comment in `bin/pyspark`. Author: Josh Rosen joshro...@apache.org Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits: 7b8eb86 [Josh Rosen] More changes to PySpark python executable

git commit: [SPARK-3886] [PySpark] use AutoBatchedSerializer by default

2014-10-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 90f73fcc4 - 72f36ee57 [SPARK-3886] [PySpark] use AutoBatchedSerializer by default Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in

git commit: [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed

2014-10-11 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 0e8203f4f - 81015a2ba [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed ./python/run-tests search a Python 2.6 executable on PATH and use it if available. When using Python 2.6, it

git commit: [SPARK-3121] Wrong implementation of implicit bytesWritableConverter

2014-10-12 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master c86c97603 - fc616d51a [SPARK-3121] Wrong implementation of implicit bytesWritableConverter val path = ... //path to seq file with BytesWritable as type of both key and value val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)

git commit: [SPARK-3121] Wrong implementation of implicit bytesWritableConverter

2014-10-12 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 5a21e3e7e - 0e3257906 [SPARK-3121] Wrong implementation of implicit bytesWritableConverter val path = ... //path to seq file with BytesWritable as type of both key and value val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)

git commit: [SPARK-3121] Wrong implementation of implicit bytesWritableConverter

2014-10-12 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 b539b0e98 - dc18167ee [SPARK-3121] Wrong implementation of implicit bytesWritableConverter val path = ... //path to seq file with BytesWritable as type of both key and value val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)

git commit: [SPARK-3905][Web UI]The keys for sorting the columns of Executor page , Stage page Storage page are incorrect

2014-10-12 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 0e3257906 - a36116c19 [SPARK-3905][Web UI]The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect Author: GuoQiang Li wi...@qq.com Closes #2763 from witgo/SPARK-3905 and squashes the following

git commit: Add echo Run streaming tests ...

2014-10-13 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master b4a7fa7a6 - d8b8c2107 Add echo Run streaming tests ... Author: Ken Takagiwa ugw.gi.wo...@gmail.com Closes #2778 from giwa/patch-2 and squashes the following commits: a59f9a1 [Ken Takagiwa] Add echo Run streaming tests ... Project:

git commit: [SPARK-3741] Add afterExecute for handleConnectExecutor

2014-10-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e7f4ea8a5 - 56fd34af5 [SPARK-3741] Add afterExecute for handleConnectExecutor Sorry. I found that I forgot to add `afterExecute` for `handleConnectExecutor` in #2593. Author: zsxwing zsxw...@gmail.com Closes #2794 from

  1   2   3   4   5   6   7   8   9   10   >