[jira] [Updated] (SPARK-26097) Show partitioning details in DAG UI
[ https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Idan Zalzberg updated SPARK-26097: -- Attachment: image (8).png > Show partitioning details in DAG UI > --- > > Key: SPARK-26097 > URL: https://issues.apache.org/jira/browse/SPARK-26097 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Idan Zalzberg >Priority: Major > Attachments: image (8).png > > > We run complex SQL queries using Spark SQL, often we have to tackle a join > skew or incorrect partition count. The problem is that while the Spark UI > shows the existence of the problem and what *stage* it is part of, it's hard > to infer back to the original SQL query that was given (e.g. what is the > specific join operation that is actually skewed). > One way to resolve this is to relate the Exchange nodes in the DAG to the > partitioning that they represent, this is actually a trivial change in code > (less than one line) that we believe can greatly benefit the research of > performance issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26097) Show partitioning details in DAG UI
Idan Zalzberg created SPARK-26097: - Summary: Show partitioning details in DAG UI Key: SPARK-26097 URL: https://issues.apache.org/jira/browse/SPARK-26097 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0 Reporter: Idan Zalzberg We run complex SQL queries using Spark SQL, often we have to tackle a join skew or incorrect partition count. The problem is that while the Spark UI shows the existence of the problem and what *stage* it is part of, it's hard to infer back to the original SQL query that was given (e.g. what is the specific join operation that is actually skewed). One way to resolve this is to relate the Exchange nodes in the DAG to the partitioning that they represent, this is actually a trivial change in code (less than one line) that we believe can greatly benefit the research of performance issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5363) Spark 1.2 freeze without error notification
[ https://issues.apache.org/jira/browse/SPARK-5363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558349#comment-14558349 ] Idan Zalzberg commented on SPARK-5363: -- Can't prove it's related to the same issue but we have been experiencing hangs with BroadcastHashJoin, even though we use the scala API. I was unable to create a simple repro, but in a complicated sql statement that contains multiple tables, joined with BroadcastHashJoin. Calling collect on the RDD, causes the spark context to hang. Spark 1.2 freeze without error notification --- Key: SPARK-5363 URL: https://issues.apache.org/jira/browse/SPARK-5363 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Tassilo Klein Assignee: Davies Liu Priority: Blocker Fix For: 1.2.2, 1.3.0, 1.4.0 After a number of calls to a map().collect() statement Spark freezes without reporting any error. Within the map a large broadcast variable is used. The freezing can be avoided by setting 'spark.python.worker.reuse = false' (Spark 1.2) or using an earlier version, however, at the prize of low speed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
[ https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341922#comment-14341922 ] Idan Zalzberg commented on SPARK-3889: -- Hi, I am still getting the same error with spark 1.2.1 (sporadically): {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7ff5ed042220, pid=3694, tid=140692916811520 # # JRE version: Java(TM) SE Runtime Environment (7.0_55-b13) (build 1.7.0_55-b13) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.55-b03 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jint_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again {noformat} Should we re-open this one, or open a new ticket? JVM dies with SIGBUS, resulting in ConnectionManager failed ACK --- Key: SPARK-3889 URL: https://issues.apache.org/jira/browse/SPARK-3889 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Fix For: 1.2.0 Here's the first part of the core dump, possibly caused by a job which shuffles a lot of very small partitions. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5319) Choosing partition size instead of count
Idan Zalzberg created SPARK-5319: Summary: Choosing partition size instead of count Key: SPARK-5319 URL: https://issues.apache.org/jira/browse/SPARK-5319 Project: Spark Issue Type: Brainstorming Reporter: Idan Zalzberg With the current API, there are multiple locations when you can set the partition count when reading from sources. However IME, it is sometimes useful to set the partition size (in MB), and infer the count from that. IME, spark is sensitive to the partition size, if they are too big, it raises the amount of memory needed per core, and if they are too small then the stage times increase significantly, so I'd like to stay in the sweet spot of the partition size, without trying to change the partition count around until I find it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5318) Add ability to control partition count in SparkSql
Idan Zalzberg created SPARK-5318: Summary: Add ability to control partition count in SparkSql Key: SPARK-5318 URL: https://issues.apache.org/jira/browse/SPARK-5318 Project: Spark Issue Type: New Feature Components: SQL Reporter: Idan Zalzberg When using SparkSql, e.g. sqlContext.sql(...), spark might need to read hadoop files. However, unlike the hadoopFile API, there is no documented way to set the minimal partition count when reading. There is an undocumented way, though, using mapred.map.tasks in hiveConf I suggest we make a documented way to do it, in the exact same way (possibly with a better name) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984070#comment-13984070 ] Idan Zalzberg commented on SPARK-1394: -- If you have an __init__.py you are sure to go through, you can add the following code to your file: {code} # PySpark adds a SIGCHLD signal handler, but that breaks other packages, so we remove it try: import signal signal.signal(signal.SIGCHLD, signal.SIG_DFL) except: pass {code} It's a work around, it would be better to have a smart signal handler that only handles the processes that are direct descendants of the daemon. I might try to get something like that out. calling system.platform on worker raises IOError Key: SPARK-1394 URL: https://issues.apache.org/jira/browse/SPARK-1394 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Environment: Tested on Ubuntu and Linux, local and remote master, python 2.7.* Reporter: Idan Zalzberg Labels: pyspark A simple program that calls system.platform() on the worker fails most of the time (it works some times but very rarely). This is critical since many libraries call that method (e.g. boto). Here is the trace of the attempt to call that method: $ /usr/local/spark/bin/pyspark Python 2.7.3 (default, Feb 27 2014, 20:00:17) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started 14/04/02 18:18:38 INFO Remoting: Starting remoting 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140402181839-919f 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 MB. 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id = ConnectionManagerId(10.33.102.46,43357) 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 10.33.102.46:43357 with 294.6 MB RAM 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at http://10.33.102.46:51803 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at http://10.33.102.46:4040 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 0.9.0 /_/ Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) Spark context available as sc. import platform sc.parallelize([1]).map(lambda x : platform.system()).collect() 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 output partitions (allowLocal=false) 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at stdin:1) 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at collect at stdin:1), which has no missing parents 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[1] at collect at stdin:1) 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 12 ms 14/04/02 18:19:17 INFO Executor: Running task ID 0 PySpark worker failed with exception: Traceback (most recent call last): File
[jira] [Created] (SPARK-1526) Running spark driver program from my local machine
Idan Zalzberg created SPARK-1526: Summary: Running spark driver program from my local machine Key: SPARK-1526 URL: https://issues.apache.org/jira/browse/SPARK-1526 Project: Spark Issue Type: Wish Components: Spark Core Reporter: Idan Zalzberg Currently it seems that the design choice is that the driver program should be close network-wise to the worker and allow connections to be created from either side. This makes using Spark somewhat harder since when I develop locally I not only to package all my program, but also all it's local dependencies. let's say I have a local DB with names of files in HADOOP that I want to process with spark, now I need my local DB to be accessible from the cluster so it can fetch the file names in runtime. The driver program is an awesome thing, but it loses some of it's strength if you can't really run it anywhere. It seems to me that the problem is with the DAGScheduler that needs to be close to the worker, maybe it shouldn't be embedded in the driver then? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959735#comment-13959735 ] Idan Zalzberg commented on SPARK-1394: -- This seems to be related to the way the handle_sigchld method in daemon.py works. In order to kill the zombie processes the worker calls os.waitpid on SIGCHLD. however. since using Popen also tries to do that eventually, you get a closed handle. Since platform.py is a native library, I would guess we should find a solution in pyspark (i.e. change the way handle_sigchld works, or maybe limit the processes it waits on) calling system.platform on worker raises IOError Key: SPARK-1394 URL: https://issues.apache.org/jira/browse/SPARK-1394 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Environment: Tested on Ubuntu and Linux, local and remote master, python 2.7.* Reporter: Idan Zalzberg Labels: pyspark A simple program that calls system.platform() on the worker fails most of the time (it works some times but very rarely). This is critical since many libraries call that method (e.g. boto). Here is the trace of the attempt to call that method: $ /usr/local/spark/bin/pyspark Python 2.7.3 (default, Feb 27 2014, 20:00:17) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started 14/04/02 18:18:38 INFO Remoting: Starting remoting 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140402181839-919f 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 MB. 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id = ConnectionManagerId(10.33.102.46,43357) 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 10.33.102.46:43357 with 294.6 MB RAM 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at http://10.33.102.46:51803 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at http://10.33.102.46:4040 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 0.9.0 /_/ Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) Spark context available as sc. import platform sc.parallelize([1]).map(lambda x : platform.system()).collect() 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 output partitions (allowLocal=false) 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at stdin:1) 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at collect at stdin:1), which has no missing parents 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[1] at collect at stdin:1) 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 12 ms 14/04/02 18:19:17 INFO Executor: Running task ID 0 PySpark worker failed with exception: Traceback (most recent call last): File /usr/local/spark/python/pyspark/worker.py, line 77, in main
[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959661#comment-13959661 ] Idan Zalzberg commented on SPARK-1394: -- It seems that the problem originates from pyspark capturing SIGCHLD in daemon.py as described here: http://stackoverflow.com/a/3837851 calling system.platform on worker raises IOError Key: SPARK-1394 URL: https://issues.apache.org/jira/browse/SPARK-1394 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Environment: Tested on Ubuntu and Linux, local and remote master, python 2.7.* Reporter: Idan Zalzberg Labels: pyspark A simple program that calls system.platform() on the worker fails most of the time (it works some times but very rarely). This is critical since many libraries call that method (e.g. boto). Here is the trace of the attempt to call that method: $ /usr/local/spark/bin/pyspark Python 2.7.3 (default, Feb 27 2014, 20:00:17) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started 14/04/02 18:18:38 INFO Remoting: Starting remoting 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140402181839-919f 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 MB. 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id = ConnectionManagerId(10.33.102.46,43357) 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 10.33.102.46:43357 with 294.6 MB RAM 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at http://10.33.102.46:51803 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at http://10.33.102.46:4040 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 0.9.0 /_/ Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) Spark context available as sc. import platform sc.parallelize([1]).map(lambda x : platform.system()).collect() 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 output partitions (allowLocal=false) 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at stdin:1) 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at collect at stdin:1), which has no missing parents 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[1] at collect at stdin:1) 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 12 ms 14/04/02 18:19:17 INFO Executor: Running task ID 0 PySpark worker failed with exception: Traceback (most recent call last): File /usr/local/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /usr/local/spark/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /usr/local/spark/python/pyspark/serializers.py, line 117, in dump_stream for