[jira] [Updated] (SPARK-26097) Show partitioning details in DAG UI

2018-11-16 Thread Idan Zalzberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Idan Zalzberg updated SPARK-26097:
--
Attachment: image (8).png

> Show partitioning details in DAG UI
> ---
>
> Key: SPARK-26097
> URL: https://issues.apache.org/jira/browse/SPARK-26097
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Idan Zalzberg
>Priority: Major
> Attachments: image (8).png
>
>
> We run complex SQL queries using Spark SQL, often we have to tackle a join 
> skew or incorrect partition count. The problem is that while the Spark UI 
> shows the existence of the problem and what *stage* it is part of, it's hard 
> to infer back to the original SQL query that was given (e.g. what is the 
> specific join operation that is actually skewed).
> One way to resolve this is to relate the Exchange nodes in the DAG to the 
> partitioning that they represent, this is actually a trivial change in code 
> (less than one line) that we believe can greatly benefit the research of 
> performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26097) Show partitioning details in DAG UI

2018-11-16 Thread Idan Zalzberg (JIRA)
Idan Zalzberg created SPARK-26097:
-

 Summary: Show partitioning details in DAG UI
 Key: SPARK-26097
 URL: https://issues.apache.org/jira/browse/SPARK-26097
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: Idan Zalzberg


We run complex SQL queries using Spark SQL, often we have to tackle a join skew 
or incorrect partition count. The problem is that while the Spark UI shows the 
existence of the problem and what *stage* it is part of, it's hard to infer 
back to the original SQL query that was given (e.g. what is the specific join 
operation that is actually skewed).
One way to resolve this is to relate the Exchange nodes in the DAG to the 
partitioning that they represent, this is actually a trivial change in code 
(less than one line) that we believe can greatly benefit the research of 
performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5363) Spark 1.2 freeze without error notification

2015-05-25 Thread Idan Zalzberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558349#comment-14558349
 ] 

Idan Zalzberg commented on SPARK-5363:
--

Can't prove it's related to the same issue but we have been experiencing hangs 
with BroadcastHashJoin, even though we use the scala API.

I was unable to create a simple repro, but in a complicated sql statement that 
contains multiple tables, joined with BroadcastHashJoin. 
Calling collect on the RDD, causes the spark context to hang.

 Spark 1.2 freeze without error notification
 ---

 Key: SPARK-5363
 URL: https://issues.apache.org/jira/browse/SPARK-5363
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Tassilo Klein
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.2.2, 1.3.0, 1.4.0


 After a number of calls to a map().collect() statement Spark freezes without 
 reporting any error.  Within the map a large broadcast variable is used.
 The freezing can be avoided by setting 'spark.python.worker.reuse = false' 
 (Spark 1.2) or using an earlier version, however, at the prize of low speed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2015-02-28 Thread Idan Zalzberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341922#comment-14341922
 ] 

Idan Zalzberg commented on SPARK-3889:
--

Hi,
I am still getting the same error with spark 1.2.1 (sporadically):
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
# 
#  SIGBUS (0x7) at pc=0x7ff5ed042220, pid=3694, tid=140692916811520
#
# JRE version: Java(TM) SE Runtime Environment (7.0_55-b13) (build 1.7.0_55-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.55-b03 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# v  ~StubRoutines::jint_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
{noformat}

Should we re-open this one, or open a new ticket?

 JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
 ---

 Key: SPARK-3889
 URL: https://issues.apache.org/jira/browse/SPARK-3889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical
 Fix For: 1.2.0


 Here's the first part of the core dump, possibly caused by a job which 
 shuffles a lot of very small partitions.
 {code}
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
 #
 # JRE version: 7.0_25-b30
 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 
 compressed oops)
 # Problematic frame:
 # v  ~StubRoutines::jbyte_disjoint_arraycopy
 #
 # Failed to write core dump. Core dumps have been disabled. To enable core 
 dumping, try ulimit -c unlimited before starting Java again
 #
 # If you would like to submit a bug report, please include
 # instructions on how to reproduce the bug and visit:
 #   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
 #
 ---  T H R E A D  ---
 Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
 worker-170 daemon [_thread_in_Java, id=6783, 
 stack(0x7fa4448ef000,0x7fa4449f)]
 siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
 si_addr=0x7fa428f79000
 {code}
 Here is the only useful content I can find related to JVM and SIGBUS from 
 Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664
 It appears it may be related to disposing byte buffers, which we do in the 
 ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
 them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5319) Choosing partition size instead of count

2015-01-19 Thread Idan Zalzberg (JIRA)
Idan Zalzberg created SPARK-5319:


 Summary: Choosing partition size instead of count
 Key: SPARK-5319
 URL: https://issues.apache.org/jira/browse/SPARK-5319
 Project: Spark
  Issue Type: Brainstorming
Reporter: Idan Zalzberg


With the current API, there are multiple locations when you can set the 
partition count when reading from sources.

However IME, it is sometimes useful to set the partition size (in MB), and 
infer the count from that. 
IME, spark is sensitive to the partition size, if they are too big, it raises 
the amount of memory needed per core, and if they are too small then the stage 
times increase significantly, so I'd like to stay in the sweet spot of the 
partition size, without trying to change the partition count around until I 
find it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5318) Add ability to control partition count in SparkSql

2015-01-19 Thread Idan Zalzberg (JIRA)
Idan Zalzberg created SPARK-5318:


 Summary: Add ability to control partition count in SparkSql
 Key: SPARK-5318
 URL: https://issues.apache.org/jira/browse/SPARK-5318
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Idan Zalzberg


When using SparkSql, e.g. sqlContext.sql(...), spark might need to read 
hadoop files.
However, unlike the hadoopFile API, there is no documented way to set the 
minimal partition count when reading.
There is an undocumented way, though, using mapred.map.tasks in hiveConf

I suggest we make a documented way to do it, in the exact same way (possibly 
with a better name)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError

2014-04-29 Thread Idan Zalzberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984070#comment-13984070
 ] 

Idan Zalzberg commented on SPARK-1394:
--

If you have an __init__.py  you are sure to go through, you can add the 
following code to your file:

{code}
# PySpark adds a SIGCHLD signal handler, but that breaks other packages, so we 
remove it
try:
import signal
signal.signal(signal.SIGCHLD, signal.SIG_DFL)
except: pass
{code}

It's a work around, it would be better to have a smart signal handler that 
only handles the processes that are direct descendants of the daemon. I might 
try to get something like that out. 

 calling system.platform on worker raises IOError
 

 Key: SPARK-1394
 URL: https://issues.apache.org/jira/browse/SPARK-1394
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
 Environment: Tested on Ubuntu and Linux, local and remote master, 
 python 2.7.*
Reporter: Idan Zalzberg
  Labels: pyspark

 A simple program that calls system.platform() on the worker fails most of the 
 time (it works some times but very rarely).
 This is critical since many libraries call that method (e.g. boto).
 Here is the trace of the attempt to call that method:
 $ /usr/local/spark/bin/pyspark
 Python 2.7.3 (default, Feb 27 2014, 20:00:17)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
 address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
 14/04/02 18:18:38 INFO Remoting: Starting remoting
 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140402181839-919f
 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
 MB.
 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
 = ConnectionManagerId(10.33.102.46,43357)
 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
 block manager 10.33.102.46:43357 with 294.6 MB RAM
 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
 http://10.33.102.46:51803
 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
 http://10.33.102.46:4040
 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
   /_/
 Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
 Spark context available as sc.
  import platform
  sc.parallelize([1]).map(lambda x : platform.system()).collect()
 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1
 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 
 output partitions (allowLocal=false)
 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
 stdin:1)
 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
 collect at stdin:1), which has no missing parents
 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[1] at collect at stdin:1)
 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
 12 ms
 14/04/02 18:19:17 INFO Executor: Running task ID 0
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File 

[jira] [Created] (SPARK-1526) Running spark driver program from my local machine

2014-04-17 Thread Idan Zalzberg (JIRA)
Idan Zalzberg created SPARK-1526:


 Summary: Running spark driver program from my local machine
 Key: SPARK-1526
 URL: https://issues.apache.org/jira/browse/SPARK-1526
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Reporter: Idan Zalzberg


Currently it seems that the design choice is that the driver program should be 
close network-wise to the worker and allow connections to be created from 
either side.

This makes using Spark somewhat harder since when I develop locally I not only 
to package all my program, but also all it's local dependencies.
let's say I have a local DB with names of files in HADOOP that I want to 
process with spark, now I need my local DB to be accessible from the cluster so 
it can fetch the file names in runtime.

The driver program is an awesome thing, but it loses some of it's strength if 
you can't really run it anywhere.

It seems to me that the problem is with the DAGScheduler that needs to be close 
to the worker, maybe it shouldn't be embedded in the driver then?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError

2014-04-04 Thread Idan Zalzberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959735#comment-13959735
 ] 

Idan Zalzberg commented on SPARK-1394:
--

This seems to be related to the way the handle_sigchld method in daemon.py 
works.
In order to kill the zombie processes the worker calls os.waitpid on SIGCHLD. 
however. since using Popen also tries to do that eventually, you get a closed 
handle.

Since platform.py is a native library, I would guess we should find a solution 
in pyspark (i.e. change the way handle_sigchld works, or maybe limit the 
processes it waits on)

 calling system.platform on worker raises IOError
 

 Key: SPARK-1394
 URL: https://issues.apache.org/jira/browse/SPARK-1394
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
 Environment: Tested on Ubuntu and Linux, local and remote master, 
 python 2.7.*
Reporter: Idan Zalzberg
  Labels: pyspark

 A simple program that calls system.platform() on the worker fails most of the 
 time (it works some times but very rarely).
 This is critical since many libraries call that method (e.g. boto).
 Here is the trace of the attempt to call that method:
 $ /usr/local/spark/bin/pyspark
 Python 2.7.3 (default, Feb 27 2014, 20:00:17)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
 address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
 14/04/02 18:18:38 INFO Remoting: Starting remoting
 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140402181839-919f
 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
 MB.
 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
 = ConnectionManagerId(10.33.102.46,43357)
 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
 block manager 10.33.102.46:43357 with 294.6 MB RAM
 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
 http://10.33.102.46:51803
 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
 http://10.33.102.46:4040
 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
   /_/
 Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
 Spark context available as sc.
  import platform
  sc.parallelize([1]).map(lambda x : platform.system()).collect()
 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1
 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 
 output partitions (allowLocal=false)
 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
 stdin:1)
 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
 collect at stdin:1), which has no missing parents
 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[1] at collect at stdin:1)
 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
 12 ms
 14/04/02 18:19:17 INFO Executor: Running task ID 0
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /usr/local/spark/python/pyspark/worker.py, line 77, in main
 

[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError

2014-04-03 Thread Idan Zalzberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959661#comment-13959661
 ] 

Idan Zalzberg commented on SPARK-1394:
--

It seems that the problem originates from pyspark capturing SIGCHLD in daemon.py
as described here: http://stackoverflow.com/a/3837851


 calling system.platform on worker raises IOError
 

 Key: SPARK-1394
 URL: https://issues.apache.org/jira/browse/SPARK-1394
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
 Environment: Tested on Ubuntu and Linux, local and remote master, 
 python 2.7.*
Reporter: Idan Zalzberg
  Labels: pyspark

 A simple program that calls system.platform() on the worker fails most of the 
 time (it works some times but very rarely).
 This is critical since many libraries call that method (e.g. boto).
 Here is the trace of the attempt to call that method:
 $ /usr/local/spark/bin/pyspark
 Python 2.7.3 (default, Feb 27 2014, 20:00:17)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
 address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
 14/04/02 18:18:38 INFO Remoting: Starting remoting
 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140402181839-919f
 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
 MB.
 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
 = ConnectionManagerId(10.33.102.46,43357)
 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
 block manager 10.33.102.46:43357 with 294.6 MB RAM
 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
 http://10.33.102.46:51803
 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
 http://10.33.102.46:4040
 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
   /_/
 Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
 Spark context available as sc.
  import platform
  sc.parallelize([1]).map(lambda x : platform.system()).collect()
 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1
 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 
 output partitions (allowLocal=false)
 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
 stdin:1)
 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
 collect at stdin:1), which has no missing parents
 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[1] at collect at stdin:1)
 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
 12 ms
 14/04/02 18:19:17 INFO Executor: Running task ID 0
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /usr/local/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /usr/local/spark/python/pyspark/serializers.py, line 182, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /usr/local/spark/python/pyspark/serializers.py, line 117, in 
 dump_stream
 for