These stack traces come from the stuck node? Looks like it's waiting on
data in BlockFetcherIterator. Waiting for data from another node. But you
say all other nodes were done? Very curious.

Maybe you could try turning on debug logging, and try to figure out what
happens in BlockFetcherIterator (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala).
I do not think it is supposed to get stuck indefinitely.

On Tue, Jun 10, 2014 at 8:22 PM, Hurwitz, Daniel <dhurw...@ebay.com> wrote:

>  Hi,
>
>
>
> We are observing a recurring issue where our Spark jobs are hanging for
> several hours, even days, until we kill them.
>
>
>
> We are running Spark v0.9.1 over YARN.
>
>
>
> Our input is a list of edges of a graph on which we use Bagel to compute
> connected components using the following method:
>
>
>
> *class* CCMessage(*var* targetId: Long, *var* myComponentId: Long)
> *extends* Message[Long] *with* Serializable
>
> *def* compute(self: CC, msgs: Option[Array[CCMessage]], superstep: Int):
> (CC, Array[CCMessage]) = {
>
>       *val* smallestComponentId = msgs.map(sq => *sq.map(_.*
> *myComponentId**)*.min).getOrElse(Long.MaxValue)
>
>       *val* newComponentId = math.min(self.clusterID, smallestComponentId)
>
>       *val* halt = (newComponentId == self.clusterID) || (superstep >=
> maxIters)
>
>       self.active = *if* (superstep == 0) *true* *else* !halt
>
>       *val* outGoingMessages = *if* (halt && superstep > 0)
> Array[CCMessage]()
>
>       *else* self.edges.map(targetId => *new* CCMessage(targetId,
> newComponentId)).toArray
>
>       self.clusterID = newComponentId
>
>
>
>       (self, outGoingMessages)
>
> }
>
>
>
> Our output is a text file in which each line is a list of the node IDs in
> each component. The size of the output may be up to 6 GB.
>
>
>
> We see in the job tracker that most of the time jobs usually get stuck on
> the “saveAsTextFile” command, the final line in our code. In some cases,
> the job will hang during one of the iterations of Bagel during the
> computation of the connected components.
>
>
>
> Oftentimes, when we kill the job and re-execute it, it will finish
> successfully within an hour which is the expected duration. We notice that
> if our Spark jobs don’t finish after a few hours, they will never finish
> until they are killed, regardless of the load on our cluster.
>
>
>
> After consulting with our Hadoop support team, they noticed that after a
> particular hanging Spark job was running for 38 hours, all Spark processes
> on all nodes were completed except for one node which was running more than
> 9 hours consuming very little CPU, then suddenly consuming 14s of CPU, then
> back to calm. Also, the other nodes were not relinquishing their resources
> until our Hadoop admin killed the process on that problematic node and
> suddenly the job finished and “success” was reported in the job tracker.
> The output seemed to be fine too. If it helps you understand the issue, the
> Hadoop admin suggested this was a Spark issue and sent us two stack dumps
> which I attached to this email: before killing the node’s Spark process
> (dump1.txt) and after (dump2.txt).
>
>
>
> Any advice on how to resolve this issue? How can we debug this?
>
>  Thanks,
>
> ~Daniel
>
>
>

Reply via email to