Re: Issue running Spark 1.4 on Yarn
Hi Kevin I never did. I checked for free space in the root partition, don't think this was an issue. Now that 1.4 is officially out I'll probably give it another shot. On Jun 22, 2015 4:28 PM, Kevin Markey kevin.mar...@oracle.com wrote: Matt: Did you ever resolve this issue? When running on a cluster or pseudocluster with too little space for /tmp or /var files, we've seen this sort of behavior. There's enough memory, and enough HDFS space, but there's insufficient space on one or more nodes for other temporary files as logs grow and don't get cleared or deleted. Depends on your configuration. Often restarting will temporarily fix things, but for shorter and shorter periods of time until nothing works. Fix is to expand space available for logs, pruning them, a cron job to prune them periodically, and/or modifying limits on logs. Kevin On 06/09/2015 04:15 PM, Matt Kapilevich wrote: I've tried running a Hadoop app pointing to the same queue. Same thing now, the job doesn't get accepted. I've cleared out the queue and killed all the pending jobs, the queue is still unusable. It seems like an issue with YARN, but it's specifically Spark that leaves the queue in this state. I've ran a Hadoop job in a for loop 10x, while specifying the queue explicitly, just to double-check. On Tue, Jun 9, 2015 at 4:45 PM, Matt Kapilevich matve...@gmail.com wrote: From the RM scheduler, I see 3 applications currently stuck in the root.thequeue queue. Used Resources: memory:0, vCores:0 Num Active Applications: 0 Num Pending Applications: 3 Min Resources: memory:0, vCores:0 Max Resources: memory:6655, vCores:4 Steady Fair Share: memory:1664, vCores:0 Instantaneous Fair Share: memory:6655, vCores:0 On Tue, Jun 9, 2015 at 4:30 PM, Matt Kapilevich matve...@gmail.com wrote: Yes! If I either specify a different queue or don't specify a queue at all, it works. On Tue, Jun 9, 2015 at 4:25 PM, Marcelo Vanzin van...@cloudera.com wrote: Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo -- Marcelo
Re: Issue running Spark 1.4 on Yarn
No, this just a random queue name I picked when submitting the job, there's no specific configuration for it. I am not logged in, so don't have the default fair scheduler configuration in front of me, but I don't think that's the problem. The cluster is completely idle, there aren't any jobs being executed, so it can't be hitting any of fair scheduler's limits. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23274.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issue running Spark 1.4 on Yarn
Hello, Since the other queues are fine, I reckon, there may be a limit in the max apps or memory on this queue in particular. I don't suspect fairscheduler limits either but on this queue we may be seeing / hitting a maximum. Could you try to get the configs for the queue? That should provide more context. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23285.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issue running Spark 1.4 on Yarn
Hi nsalian, For some reason the rest of this thread isn't showing up here. The NodeManager isn't busy. I'll copy/paste, the details are in there. I've tried running a Hadoop app pointing to the same queue. Same thing now, the job doesn't get accepted. I've cleared out the queue and killed all the pending jobs, the queue is still unusable. It seems like an issue with YARN, but it's specifically Spark that leaves the queue in this state. I've ran a Hadoop job in a for loop 10x, while specifying the queue explicitly, just to double-check. On Tue, Jun 9, 2015 at 4:45 PM, Matt Kapilevich matve...@gmail.com wrote: From the RM scheduler, I see 3 applications currently stuck in the root.thequeue queue. Used Resources: memory:0, vCores:0 Num Active Applications: 0 Num Pending Applications: 3 Min Resources: memory:0, vCores:0 Max Resources: memory:6655, vCores:4 Steady Fair Share: memory:1664, vCores:0 Instantaneous Fair Share: memory:6655, vCores:0 On Tue, Jun 9, 2015 at 4:30 PM, Matt Kapilevich matve...@gmail.com wrote: Yes! If I either specify a different queue or don't specify a queue at all, it works. On Tue, Jun 9, 2015 at 4:25 PM, Marcelo Vanzin van...@cloudera.com wrote: Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo -- Marcelo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23258.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issue running Spark 1.4 on Yarn
Hi, Thanks for the added information. Helps add more context. Is that specific queue different from the others? FairScheduler.xml should have the information needed.Or if you have a separate allocations.xml. Something of this format: allocations queue name=sample_queue minResources1 mb,0vcores/minResources maxResources9 mb,0vcores/maxResources maxRunningApps50/maxRunningApps maxAMShare0.1/maxAMShare weight2.0/weight schedulingPolicyfair/schedulingPolicy queue name=sample_sub_queue aclSubmitAppscharlie/aclSubmitApps minResources5000 mb,0vcores/minResources /queue /queue Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23261.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issue running Spark 1.4 on Yarn
If your application is stuck in that state, it generally means your cluster doesn't have enough resources to start it. In the RM logs you can see how many vcores / memory the application is asking for, and then you can check your RM configuration to see if that's currently available on any single NM. On Tue, Jun 9, 2015 at 7:56 AM, Matt Kapilevich matve...@gmail.com wrote: Hi all, I'm manually building Spark from source against 1.4 branch and submitting the job against Yarn. I am seeing very strange behavior. The first 2 or 3 times I submit the job, it runs fine, computes Pi, and exits. The next time I run it, it gets stuck in the ACCEPTED state. I'm kicking off a job using yarn-client mode like this: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3--driver-memory 4g --executor-memory 2g--executor-cores 1--queue thequeue examples/target/scala-2.10/spark-examples*.jar10 Here's what ResourceManager shows:[image: Yarn ResourceManager UI] In Yarn ResourceManager logs, all I'm seeing is this: 2015-06-08 14:49:57,166 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1433789077942_0004_01 to scheduler from user: root 2015-06-08 14:49:57,166 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1433789077942_0004_01 State change from SUBMITTED to SCHEDULED There's nothing in the NodeManager logs (though its up and running), the job isn't getting that far. It seems to me that there's an issue somewhere between Spark 1.4 and Yarn integration. Hadoop runs without any issues. I've ran the below multiple times. yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.4.2.jar pi 16 100 For reference, I'm compiling the source against 1.4 branch, and running it on a single-node cluster with CDH5.4 and Hadoop 2.6, distributed mode. I am using the following to compile: mvn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Pyarn -Phive -Phive-thriftserver -DskipTests clean package Any help appreciated. Thanks, -Matt -- Marcelo
Re: Issue running Spark 1.4 on Yarn
Hi Marcelo, Thanks. I think something more subtle is happening. I'm running a single-node cluster, so there's only 1 NM. When I executed the exact same job the 4th time, the cluster was idle, and there was nothing else being executed. RM currently reports that I have 6.5GB of memory and 4 cpus available. However, the job is still stuck in the ACCEPTED state a day later. Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. Thanks, -Matt On Tue, Jun 9, 2015 at 12:32 PM, Marcelo Vanzin van...@cloudera.com wrote: If your application is stuck in that state, it generally means your cluster doesn't have enough resources to start it. In the RM logs you can see how many vcores / memory the application is asking for, and then you can check your RM configuration to see if that's currently available on any single NM. On Tue, Jun 9, 2015 at 7:56 AM, Matt Kapilevich matve...@gmail.com wrote: Hi all, I'm manually building Spark from source against 1.4 branch and submitting the job against Yarn. I am seeing very strange behavior. The first 2 or 3 times I submit the job, it runs fine, computes Pi, and exits. The next time I run it, it gets stuck in the ACCEPTED state. I'm kicking off a job using yarn-client mode like this: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --num-executors 3--driver-memory 4g --executor-memory 2g--executor-cores 1--queue thequeue examples/target/scala-2.10/spark-examples*.jar10 Here's what ResourceManager shows:[image: Yarn ResourceManager UI] In Yarn ResourceManager logs, all I'm seeing is this: 2015-06-08 14:49:57,166 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1433789077942_0004_01 to scheduler from user: root 2015-06-08 14:49:57,166 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1433789077942_0004_01 State change from SUBMITTED to SCHEDULED There's nothing in the NodeManager logs (though its up and running), the job isn't getting that far. It seems to me that there's an issue somewhere between Spark 1.4 and Yarn integration. Hadoop runs without any issues. I've ran the below multiple times. yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.4.2.jar pi 16 100 For reference, I'm compiling the source against 1.4 branch, and running it on a single-node cluster with CDH5.4 and Hadoop 2.6, distributed mode. I am using the following to compile: mvn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Pyarn -Phive -Phive-thriftserver -DskipTests clean package Any help appreciated. Thanks, -Matt -- Marcelo
Re: Issue running Spark 1.4 on Yarn
Yes! If I either specify a different queue or don't specify a queue at all, it works. On Tue, Jun 9, 2015 at 4:25 PM, Marcelo Vanzin van...@cloudera.com wrote: Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo -- Marcelo
Re: Issue running Spark 1.4 on Yarn
From the RM scheduler, I see 3 applications currently stuck in the root.thequeue queue. Used Resources: memory:0, vCores:0 Num Active Applications: 0 Num Pending Applications: 3 Min Resources: memory:0, vCores:0 Max Resources: memory:6655, vCores:4 Steady Fair Share: memory:1664, vCores:0 Instantaneous Fair Share: memory:6655, vCores:0 On Tue, Jun 9, 2015 at 4:30 PM, Matt Kapilevich matve...@gmail.com wrote: Yes! If I either specify a different queue or don't specify a queue at all, it works. On Tue, Jun 9, 2015 at 4:25 PM, Marcelo Vanzin van...@cloudera.com wrote: Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo -- Marcelo
Re: Issue running Spark 1.4 on Yarn
Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo
Re: Issue running Spark 1.4 on Yarn
Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo -- Marcelo
Re: Issue running Spark 1.4 on Yarn
Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo
Re: Issue running Spark 1.4 on Yarn
On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo
Re: Issue running Spark 1.4 on Yarn
I see the other jobs SUCCEEDED without issues. Could you snapshot the FairScheduler activity as well? My guess it, with the single core, it is reaching a NodeManager that is still busy with other jobs and the job ends up in a waiting state. Does the job eventually complete? Could you potentially add another node to the cluster to see if my guess is right? I just see one Active NM. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-running-Spark-1-4-on-Yarn-tp23211p23236.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issue running Spark 1.4 on Yarn
I've tried running a Hadoop app pointing to the same queue. Same thing now, the job doesn't get accepted. I've cleared out the queue and killed all the pending jobs, the queue is still unusable. It seems like an issue with YARN, but it's specifically Spark that leaves the queue in this state. I've ran a Hadoop job in a for loop 10x, while specifying the queue explicitly, just to double-check. On Tue, Jun 9, 2015 at 4:45 PM, Matt Kapilevich matve...@gmail.com wrote: From the RM scheduler, I see 3 applications currently stuck in the root.thequeue queue. Used Resources: memory:0, vCores:0 Num Active Applications: 0 Num Pending Applications: 3 Min Resources: memory:0, vCores:0 Max Resources: memory:6655, vCores:4 Steady Fair Share: memory:1664, vCores:0 Instantaneous Fair Share: memory:6655, vCores:0 On Tue, Jun 9, 2015 at 4:30 PM, Matt Kapilevich matve...@gmail.com wrote: Yes! If I either specify a different queue or don't specify a queue at all, it works. On Tue, Jun 9, 2015 at 4:25 PM, Marcelo Vanzin van...@cloudera.com wrote: Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like this. On Tue, Jun 9, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps. Check your RM logs for any line that mentions your Spark app id. That may give you some insight into what's happening or not. -- Marcelo -- Marcelo -- Marcelo