please provide your jstack info.
------------------ ???????? ------------------ ??????: "dhruve ashar";<dhruveas...@gmail.com>; ????????: 2016??7??13??(??????) ????3:53 ??????: "Anton Sviridov"<keyn...@gmail.com>; ????: "user"<user@spark.apache.org>; ????: Re: Spark hangs at "Removed broadcast_*" Looking at the jstack, it seems that it doesn't contain all the threads. Cannot find the main thread in the jstack. I am not an expert on analyzing jstacks, but are you creating any threads in your code? Shutting them down correctly? This one is a non-daemon and doesn't seem to be coming from Spark. "Scheduler-2144644334" #110 prio=5 os_prio=0 tid=0x00007f8104001800 nid=0x715 waiting on condition [0x00007f812cf95000] Also, does the shutdown hook get called? On Tue, Jul 12, 2016 at 2:35 AM, Anton Sviridov <keyn...@gmail.com> wrote: Hi. Here's the last few lines before it starts removing broadcasts: 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003209_20886' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003209 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003209_20886: Committed 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0 (TID 20888) in 95 ms on localhost (3209/3214) 16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID 20886). 1721 bytes result sent to driver 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0 (TID 20886) in 103 ms on localhost (3210/3214) 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003208_20885' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003208 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003208_20885: Committed 16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID 20885). 1721 bytes result sent to driver 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0 (TID 20885) in 109 ms on localhost (3211/3214) 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003212_20889' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003212 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003212_20889: Committed 16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID 20889). 1721 bytes result sent to driver 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0 (TID 20889) in 84 ms on localhost (3212/3214) 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003210_20887' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003210 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003210_20887: Committed 16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID 20887). 1721 bytes result sent to driver 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0 (TID 20887) in 100 ms on localhost (3213/3214) 16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 'attempt_201607111123_0009_m_003213_20890' to file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003213 16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: attempt_201607111123_0009_m_003213_20890: Committed 16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID 20890). 1721 bytes result sent to driver 16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0 (TID 20890) in 82 ms on localhost (3214/3214) 16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool 16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at SfCountsDumper.scala:13) finished in 42.294 s 16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at SfCountsDumper.scala:13, took 9517.124624 s 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB) 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2 16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1 16/07/11 14:28:46 INFO BlockManager: Removing RDD 14 16/07/11 14:28:46 INFO ContextCleaner: Cleaned RDD 14 16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on 10.101.230.154:35192 in memory (size: 25.5 KB, free: 37.1 GB) ... In fact, the job is still running, Spark's UI shows uptime of 20.6 hours with last job finishing 18 hours ago at least. On Mon, 11 Jul 2016 at 23:23 dhruve ashar <dhruveas...@gmail.com> wrote: Hi, Can you check the time when the job actually finished from the logs. The logs provided are too short and do not reveal meaningful information. On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <keyn...@gmail.com> wrote: Spark 2.0.0-preview We've got an app that uses a fairly big broadcast variable. We run this on a big EC2 instance, so deployment is in client-mode. Broadcasted variable is a massive Map[String, Array[String]]. At the end of saveAsTextFile, the output in the folder seems to be complete and correct (apart from .crc files still being there) BUT the spark-submit process is stuck on, seemingly, removing the broadcast variable. The stuck logs look like this: http://pastebin.com/wpTqvArY My last run lasted for 12 hours after after doing saveAsTextFile - just sitting there. I did a jstack on driver process, most threads are parked: http://pastebin.com/E29JKVT7 Full store: We used this code with Spark 1.5.0 and it worked, but then the data changed and something stopped fitting into Kryo's serialisation buffer. Increasing it didn't help, so I had to disable the KryoSerialiser. Tested it again - it hanged. Switched to 2.0.0-preview - seems like the same issue. I'm not quite sure what's even going on given that there's almost no CPU activity and no output in the logs, yet the output is not finalised like it used to before. Would appreciate any help, thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org -- -Dhruve Ashar -- -Dhruve Ashar