Hi all, a correction:
I saw the correct output of nvidia-smi in the stdout file in the tasks work dir on the agent (that was the piece I didn’t get, reading helps!). So I have to see why the framework doesn’t receive any offers. Thanks, Ben > On 7. Jun 2020, at 15:01, Benjamin Wulff <benjamin.wulff...@ieee.org> wrote: > > Hi all, > > I found the gnu-support site in the docs (1) and tried the following command: > > # mesos-execute --master=129.26.78.161:5050 --name=gpu-test > --command="nvidia-smi" --framework_capabilities="GPU_RESOURCES" > --resources="gpus:1” > > ..and that gave the following output: > > I0607 14:57:41.897706 56361 scheduler.cpp:189] Version: 1.9.0 > I0607 14:57:41.913520 56361 scheduler.cpp:342] Using default 'basic' HTTP > authenticatee > I0607 14:57:41.913813 56367 scheduler.cpp:525] New master detected at > master@129.26.78.161 <mailto:master@129.26.78.161>:5050 > Subscribed with ID f2e21b9a-3bb6-4f40-bfe3-3ec4f8eda64a-0005 > Submitted task 'gpu-test' to agent 'f2e21b9a-3bb6-4f40-bfe3-3ec4f8eda64a-S0' > Received status update TASK_STARTING for task 'gpu-test' > source: SOURCE_EXECUTOR > Received status update TASK_RUNNING for task 'gpu-test' > source: SOURCE_EXECUTOR > Received status update TASK_FINISHED for task 'gpu-test' > message: 'Command exited with status 0' > source: SOURCE_EXECUTOR > > I did not see the output of nvidia-smi as I should have according to the > documentation. > > I have attached the logs of master and agent. > > Thanks, > Ben > <mesos-master.log.INFO.txt> > <mesos-slave.node-01.root.log.INFO.2.log> > > >> On 7. Jun 2020, at 02:43, Benjamin Mahler <bmah...@apache.org >> <mailto:bmah...@apache.org>> wrote: >> >> Don't worry about that "Ignoring" message on the agent. When the framework >> information is updated, the master broadcasts it to the agents, and in this >> case the agent doesn't know about the framework since it has no tasks for >> it, and so it ignores the updated information. >> >> I can't quite tell from the log snippet you provided. Assuming this is the >> only scheduler registered, it should receive offers for all the agents for >> the scheduler's roles (in this case, should just be the '*' role). >> >> Some reasons offers might not be sent: >> >> -Framework doesn't have capability required to be offered the agent (e.g. >> scheduler doesn't have GPU_RESOURCES when the agent has GPUs). >> -Framework suppressed its role(s) (doesn't seem to be the case from the log >> snippet) >> -The role has insufficient quota (e.g. if you have set a quota limit for >> that role, or if other roles have quota guarantees overcommitting the >> cluster) >> -The agent's resources are reserved to a role. >> >> Can you show us the scheduler code? Can you give us complete logs, along >> with results of the agent and master /state endpoints? >> >> >> On Sat, Jun 6, 2020 at 8:00 AM Benjamin Wulff <benjamin.wulff...@ieee.org >> <mailto:benjamin.wulff...@ieee.org>> wrote: >> So with logging_level set to INFO (and master and slave restarted) I noticed >> in /var/log/mesos.INFO on the agent the following line: >> >> I0606 13:46:41.393455 206117 slave.cpp:4222] Ignoring info update for >> framework 2777de92-bc91-4e48-9960-bbab05694665-0000 because it does not exist >> >> That is indeed the ID of the framework I’d like to run my task. In the web >> UI the framework is listed. So why is the agent saying that it doesn’t >> exist? What is the semantic of this message? >> >> On the master in /var/log/mesos-master.INFO the last relevant log lines >> (after that comes HTTP requests) are: >> >> I0606 13:50:11.710996 52025 http.cpp:1115] HTTP POST for >> /master/api/v1/scheduler from 172.30.0.8:41378 <http://172.30.0.8:41378/> >> with User-Agent='python-requests/2.23.0' >> I0606 13:50:11.720903 52025 master.cpp:2670] Received subscription request >> for HTTP framework 'Go-Docker' >> I0606 13:50:11.721153 52025 master.cpp:2742] Subscribing framework >> 'Go-Docker' with checkpointing disabled and capabilities [ ] >> I0606 13:50:11.721993 52025 master.cpp:10847] Adding framework >> 2777de92-bc91-4e48-9960-bbab05694665-0000 (Go-Docker) with roles { } >> suppressed >> I0606 13:50:11.722084 52025 master.cpp:8300] Updating framework >> 2777de92-bc91-4e48-9960-bbab05694665-0000 (Go-Docker) with roles { } >> suppressed >> I0606 13:50:11.722514 52030 hierarchical.cpp:605] Added framework >> 2777de92-bc91-4e48-9960-bbab05694665-0000 >> I0606 13:50:11.722573 52030 hierarchical.cpp:711] Deactivated framework >> 2777de92-bc91-4e48-9960-bbab05694665-0000 >> I0606 13:50:11.722625 52030 hierarchical.cpp:1552] Suppressed offers for >> roles { } of framework 2777de92-bc91-4e48-9960-bbab05694665-0000 >> I0606 13:50:11.722657 52030 hierarchical.cpp:1592] Unsuppressed offers and >> cleared filters for roles { } of framework >> 2777de92-bc91-4e48-9960-bbab05694665-0000 >> I0606 13:50:11.722703 52030 hierarchical.cpp:681] Activated framework >> 2777de92-bc91-4e48-9960-bbab05694665-0000 >> >> From my novice perspective it seems the framework is registered.. >> >> Thanks, >> Ben >> >> >> > On 6. Jun 2020, at 13:40, Marc Roos <m.r...@f1-outsourcing.eu >> > <mailto:m.r...@f1-outsourcing.eu>> wrote: >> > >> > >> > >> > >> > You already put these on debug? >> > >> > [@ ]# cat /etc/mesos-master/logging_level >> > WARNING >> > [@ ]# cat /etc/mesos-slave/logging_level >> > WARNING >> > >> > >> > >> > >> > -----Original Message----- >> > From: Benjamin Wulff [mailto:benjamin.wulff...@ieee.org >> > <mailto:benjamin.wulff...@ieee.org>] >> > Sent: zaterdag 6 juni 2020 13:36 >> > To: user@mesos.apache.org <mailto:user@mesos.apache.org> >> > Subject: No offers are being made -- how to debug Mesos? >> > >> > Hi all, >> > >> > I’m in the process of setting up my first Mesos cluster with 1x master >> > and 3x slaves on CentOS 8. >> > >> > So far set up Zookeepr and Mesos-master on the master and Mesos-slave on >> > one of the compute nodes. Mesos-master communicates with ZK and becomes >> > leader. Then I started memos-slave on the compute node and can see in >> > the log that it registers at the master with the correct resources >> > reported. The agent and its resources are also displayed in the web UI >> > of the master. So is the framework that I want to use. >> > >> > The crux is that no tasks I schedule in the framework are executed. And >> > I suppose this is because the framework never receives an offer. I can >> > see in the web UI that no offers are made and that all resources remain >> > idle. >> > >> > Now, I’m new to Mesos and I don’t really have an idea how to debug my >> > setup at this point. >> > >> > There is a page called ‘Debugging with the new CLI’ in the >> > documentation but it only explains how to configure the CLI command. >> > >> > Any directions how to debug in my situation in general or on how to use >> > the CLI for debugging would be highly welcome! :) >> > >> > Thanks and best regards, >> > Ben >> > >> > >> > >> >