git commit: [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8

2014-08-18 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 3a5962f0f - d1d0ee41c [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8 bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as utf-8. Author: Davies Liu

git commit: [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8

2014-08-18 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 cc4015d2f - e08333463 [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8 bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as utf-8. Author: Davies Liu

git commit: [SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL.

2014-08-18 Thread joshrosen
tests, irrespective of whether SparkSQL itself has been modified. It also includes Davies' fix for the bug. Closes #2026. Author: Josh Rosen joshro...@apache.org Author: Davies Liu davies@gmail.com Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits: 9af2708

git commit: [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager

2014-08-16 Thread joshrosen
an opportunity to clean this up later if we sever the circular dependencies between BlockManager and other components and pass those components to BlockManager's constructor. Author: Josh Rosen joshro...@apache.org Closes #1976 from JoshRosen/SPARK-2977 and squashes the following commits: a9cd1e1 [Josh

git commit: [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager

2014-08-16 Thread joshrosen
to clean this up later if we sever the circular dependencies between BlockManager and other components and pass those components to BlockManager's constructor. Author: Josh Rosen joshro...@apache.org Closes #1976 from JoshRosen/SPARK-2977 and squashes the following commits: a9cd1e1 [Josh Rosen

git commit: [SPARK-2677] BasicBlockFetchIterator#next can wait forever

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 8c7957446 - bd3ce2ffb [SPARK-2677] BasicBlockFetchIterator#next can wait forever Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #1632 from sarutak/SPARK-2677 and squashes the following commits: cddbc7b [Kousuke Saruta]

git commit: [SPARK-3035] Wrong example with SparkContext.addFile

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 a12d3ae32 - 721f2fdc9 [SPARK-3035] Wrong example with SparkContext.addFile https://issues.apache.org/jira/browse/SPARK-3035 fix for wrong document. Author: iAmGhost kdh7...@gmail.com Closes #1942 from iAmGhost/master and squashes

git commit: [SPARK-3035] Wrong example with SparkContext.addFile

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master ac6411c6e - 379e7585c [SPARK-3035] Wrong example with SparkContext.addFile https://issues.apache.org/jira/browse/SPARK-3035 fix for wrong document. Author: iAmGhost kdh7...@gmail.com Closes #1942 from iAmGhost/master and squashes the

git commit: [SPARK-1065] [PySpark] improve supporting for large broadcast

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 721f2fdc9 - 5dd571c29 [SPARK-1065] [PySpark] improve supporting for large broadcast Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()). Add an option to keep

git commit: In the stop method of ConnectionManager to cancel the ackTimeoutMonitor

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 2fc8aca08 - bc95fe08d In the stop method of ConnectionManager to cancel the ackTimeoutMonitor cc JoshRosen sarutak Author: GuoQiang Li wi...@qq.com Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following commits

git commit: In the stop method of ConnectionManager to cancel the ackTimeoutMonitor

2014-08-16 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 5dd571c29 - f02e327f0 In the stop method of ConnectionManager to cancel the ackTimeoutMonitor cc JoshRosen sarutak Author: GuoQiang Li wi...@qq.com Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following commits

git commit: [PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 Fixes

2014-08-11 Thread joshrosen
TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed. - Fix MLlib _deserialize_double on Python 2.6. Closes #1868. Closes #1042. Author: Josh Rosen joshro...@apache.org Closes #1874 from JoshRosen/python2.6 and squashes the following commits: 983d259 [Josh Rosen] [SPARK-2954] Fix

git commit: [PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 Fixes

2014-08-11 Thread joshrosen
TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed. - Fix MLlib _deserialize_double on Python 2.6. Closes #1868. Closes #1042. Author: Josh Rosen joshro...@apache.org Closes #1874 from JoshRosen/python2.6 and squashes the following commits: 983d259 [Josh Rosen] [SPARK-2954] Fix

git commit: [SPARK-2931] In TaskSetManager, reset currentLocalityIndex after recomputing locality levels

2014-08-11 Thread joshrosen
is to reset currentLocalityIndex after recomputing the locality levels. Thanks to kayousterhout, mridulm, and lirui-intel for helping me to debug this. Author: Josh Rosen joshro...@apache.org Closes #1896 from JoshRosen/SPARK-2931 and squashes the following commits: 48b60b5 [Josh Rosen] Move

git commit: [SPARK-2931] In TaskSetManager, reset currentLocalityIndex after recomputing locality levels

2014-08-11 Thread joshrosen
here is to reset currentLocalityIndex after recomputing the locality levels. Thanks to kayousterhout, mridulm, and lirui-intel for helping me to debug this. Author: Josh Rosen joshro...@apache.org Closes #1896 from JoshRosen/SPARK-2931 and squashes the following commits: 48b60b5 [Josh Rosen

git commit: [SPARK-2898] [PySpark] fix bugs in deamon.py

2014-08-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 1d03a26a4 - 28dcbb531 [SPARK-2898] [PySpark] fix bugs in deamon.py 1. do not use signal handler for SIGCHILD, it's easy to cause deadlock 2. handle EINTR during accept() 3. pass errno into JVM 4. handle EAGAIN during fork() Now, it can

git commit: [SPARK-2898] [PySpark] fix bugs in deamon.py

2014-08-10 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 bb23b118e - 92daffed4 [SPARK-2898] [PySpark] fix bugs in deamon.py 1. do not use signal handler for SIGCHILD, it's easy to cause deadlock 2. handle EINTR during accept() 3. pass errno into JVM 4. handle EAGAIN during fork() Now, it

git commit: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e053c5581 - 59f84a953 [SPARK-1687] [PySpark] pickable namedtuple Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE from collections import

git commit: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 3823f6d25 - bfd2f3958 [SPARK-1687] [PySpark] pickable namedtuple Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs. PS: pyspark should be import BEFORE from collections import

git commit: [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.1 aa7a48ee9 - 2225d18a7 [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times. Author:

git commit: [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple

2014-08-04 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 8e7d5ba1a - 9fd82dbbc [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times. Author:

git commit: [SPARK-1740] [PySpark] kill the python worker

2014-08-03 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e139e2be6 - 55349f9fe [SPARK-1740] [PySpark] kill the python worker Kill only the python worker related to cancelled tasks. The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is

git commit: Improvements to merge_spark_pr.py

2014-07-31 Thread joshrosen
merged. Both of these fixes are useful when backporting changes. Author: Josh Rosen joshro...@apache.org Closes #1668 from JoshRosen/pr-script-improvements and squashes the following commits: ff4f33a [Josh Rosen] Default SPARK_HOME to cwd(); detect missing JIRA credentials. ed5bc57 [Josh Rosen

git commit: Docs: monitoring, streaming programming guide

2014-07-31 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master e02136214 - cc820502f Docs: monitoring, streaming programming guide Fix several awkward wordings and grammatical issues in the following documents: * docs/monitoring.md * docs/streaming-programming-guide.md Author: kballou

[2/2] git commit: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-30 Thread joshrosen
[SPARK-2024] Add saveAsSequenceFile to PySpark JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024 This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats. * Added RDD methods ```saveAsSequenceFile```,

[1/2] [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-30 Thread joshrosen
Repository: spark Updated Branches: refs/heads/master 437dc8c5b - 94d1f46fc http://git-wip-us.apache.org/repos/asf/spark/blob/94d1f46f/python/pyspark/tests.py -- diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py

git commit: [SPARK-2737] Add retag() method for changing RDDs' ClassTags.

2014-07-30 Thread joshrosen
with an incorrect ClassTag by wrapping it and overriding its ClassTag. This should be okay for cases where the Scala code that calls collect() knows what type of array should be allocated, which is the case in the MLlib wrappers. Author: Josh Rosen joshro...@apache.org Closes #1639 from JoshRosen/SPARK

git commit: [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket

2014-07-29 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 1a0a2f81a - 2693035ba [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket During rdd.take(n), JVM will close the socket if it had got enough data, the Python worker should keep silent in this case. In the same time,

git commit: [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle

2014-07-29 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-1.0 2693035ba - e0bc72eb7 [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle fix the problem with pickle operator.itemgetter with multiple index. Author: Davies Liu davies@gmail.com Closes #1627 from davies/itemgetter and

git commit: [SPARK-2547]:The clustering documentaion example provided for spark 0.9....

2014-07-26 Thread joshrosen
Repository: spark Updated Branches: refs/heads/branch-0.9 c37db1537 - 7e4a0e1a0 [SPARK-2547]:The clustering documentaion example provided for spark 0.9 I modified a trivial mistake in the MLlib documentation. I checked that the python sample code for a k-means clustering can correctly

<    5   6   7   8   9   10