Mesos slave ID change after reboot
I am trying to understand under what cases the mesos slave ID changes in response to reboot. I noticed this note at http://mesos.apache.org/documentation/latest/upgrades/#upgrading-from-1-3-x-to-1-4-x : Agent is now allowed to recover its agent ID post a host reboot. This > prevents the unnecessary discarding of agent ID by prior Mesos versions. > Notes about backwards compatibility: > >- In case the agent’s recovery runs into agent info mismatch which may >happen due to resource change associated with reboot, it’ll fall back to >recovering as a new agent (existing behavior). > > >- In other cases such as checkpointed resources (e.g. persistent >volumes) being incompatible with the agent’s resources the recovery will >still fail (existing behavior). > > I was wondering if the behavior prior to 1.3 is also similarly well-defined. Is the answer "Will always change after a reboot"? Thanks, Srikanth
Re: java driver/shutdown call
Thanks Vinod. Is there a V1SchedulerDriver.java file? I see https://github.com/apache/mesos/tree/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/java/src/org/apache/mesos/v1/scheduler but it does not have a V1 driver. On Fri, Jan 5, 2018 at 3:59 PM, Vinod Konewrote: > That's right. It is only available for v1 schedulers. > > On Fri, Jan 5, 2018 at 3:38 PM, Mohit Jaggi wrote: > >> Folks, >> I am trying to change Apache Aurora's code to call SHUTDOWN instead of >> KILL. However, it seems that the SchedulerDriver class in Mesos does not >> have a shutdownExecutor() call. >> >> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448 >> dc599ef3774339d31/src/java/src/org/apache/mesos/SchedulerDriver.java >> >> Mohit. >> > >
Re: Mesos replicated log fills disk with logging output
Thanks for the hint! The cluster is using ext4, and judging from the linked thread this could have indeed be caused by a stalling hypervisor. From: Jie YuReply-To: "user@mesos.apache.org" Date: Monday, 8. January 2018 at 23:36 To: user Subject: Re: Mesos replicated log fills disk with logging output Stephan, I haven't seen that before. A quick Google search suggests that it might be related to leveldb. The following thread might be related. https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ What is the filesystem you're using? - JIe On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erb > wrote: Hi everyone, a few days ago, we have bumped into an interesting issue that we had not seen before. Essentially, one of our toy clusters dissolved itself: · 3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper (3.4.5) for leader election · Master 1 and master 2 had 100% disk usage, because /var/lib/mesos/replicated_log/LOG had grown to about 170 GB · The replicated log of both Master 1 and 2 was corrupted. A process restart did not fix it. · The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was caused by the full disk. · Master 3 was the leading Mesos master and healthy. Its disk usage was normal. The content of /var/lib/mesos/replicated_log/LOG was an endless stream of: 2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753 2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started 2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK 2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753 2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751 2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started 2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK 2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757 2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760 2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from (begin) .. (end); will stop at '003060' @ 9423 : 1 2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761 2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762 2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory 2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files 2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ] 2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763 Content of the associated folder: /var/lib/mesos/replicated_log.corrupted# ls -la total 964480 drwxr-xr-x 2 mesos mesos 4096 Jan 5 10:12 . drwxr-xr-x 4 mesos mesos 4096 Jan 5 10:27 .. -rw-r--r-- 1 mesos mesos 724 Dec 14 16:22 001735.ldb -rw-r--r-- 1 mesos mesos 7393 Dec 14 16:45 001737.sst -rw-r--r-- 1 mesos mesos 22129 Jan 3 12:53 001742.sst -rw-r--r-- 1 mesos mesos 14967 Jan 3 13:00 001747.sst -rw-r--r-- 1 mesos mesos 7526 Jan 4 12:30 001756.sst -rw-r--r-- 1 mesos mesos 15113 Jan 5 10:08 001765.sst -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:09 001767.log -rw-r--r-- 1 mesos mesos16 Jan 5 10:08 CURRENT -rw-r--r-- 1 mesos mesos 0 Aug 25 2015 LOCK -rw-r--r-- 1 mesos mesos 178303865220 Jan 5 10:12 LOG -rw-r--r-- 1 mesos mesos 463093282 Jan 5 10:08 LOG.old -rw-r--r-- 1 mesos mesos 65536 Jan 5 10:08 MANIFEST-001764 Monitoring indicates that the disk usage started to grow shortly after a badly coordinated configuration deployment change: · Master 1 was leading and restarted after a few hours of uptime · Master 2 was now leading. After a few seconds (30s-60s or so) it got restarted as well · Master 3 was now leading (and continued to do so) I have to admit I am a bit surprised that the restart scenario could lead to the issues described above. Has anyone seen similar issues as well? Thanks and best regards, Stephan
Re: Mesos rare TASK_LOST scenario v 0.21.0
The command executor was probably fixed somewhere between 0.21 and 1.3. The only reason I mentioned 1.3+ is because any releases before that are out of support period. If you can repro the issue with 1.3+ and paste the logs here or in a JIRA, we can help debug it for you. On Wed, Jan 10, 2018 at 9:47 AM, Ajay Vwrote: > Thanks for getting back Vinod. So, does that mean that even for v1.2, > these race conditions (where the command executor doesn't stay long enough > ) existed and that 1.3 versions fixes them ?. Reason for asking is because > I did try an upgrade to v1.2 and still found very similar issues. > > Regards, > Ajay > > On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone wrote: > >> 0.21 is really old and not supported. I highly recommend you upgrade to >> 1.3+. >> >> Regarding what you are seeing, we definitely had issues in the past where >> the command executor didn't stay up long enough to guarantee that >> TASK_FINISHED was delivered to the agent; so races like above were possible. >> >> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V wrote: >> >>> Hello, >>> >>> I'm trying to debug a TASK_LOST thats generated on the agent that I see >>> on rare occasions. >>> >>> Following is a log that I'm trying to understand. This is happening >>> after the driver.sendStatusUpdate() has been called with a task state of >>> TASK_FINISHED from a java executor. It looks to me like the container is >>> already exited before the TASK_FINISHED is processed. Is there a timing >>> issue here in this version of mesos that is causing this? The effect of >>> this problem is that, even though the work of the executor is complete and >>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is >>> marked as LOST and the actual update of TASK_FINISHED is ignored. >>> >>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for >>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited >>> >>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container >>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >>> >>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown >>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >>> >>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource >>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because: >>> Failed to get usage: No process found at 28952 >>> >>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor >>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >>> 20171208-050805-140555025-5050-3470- exited with status 0 >>> >>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update >>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- from @0.0.0.0:0 >>> >>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task >>> ff631ad1-cfab-493e-be18-961581abcf3d >>> >>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for >>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109 >>> >>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received >>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding >>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- to the slave >>> >>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update >>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050 >>> >>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager >>> successfully handled status update TASK_LOST (UUID: >>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received >>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up >>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of >>> framework 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager >>> successfully handled status update acknowledgement (UUID: >>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >>> ff631ad1-cfab-493e-be18-961581abcf3d of framework >>> 20171208-050805-140555025-5050-3470- >>> >>> I0108 10:16:52.958693 37268
Re: Mesos rare TASK_LOST scenario v 0.21.0
Thanks for getting back Vinod. So, does that mean that even for v1.2, these race conditions (where the command executor doesn't stay long enough ) existed and that 1.3 versions fixes them ?. Reason for asking is because I did try an upgrade to v1.2 and still found very similar issues. Regards, Ajay On Tue, Jan 9, 2018 at 6:48 PM, Vinod Konewrote: > 0.21 is really old and not supported. I highly recommend you upgrade to > 1.3+. > > Regarding what you are seeing, we definitely had issues in the past where > the command executor didn't stay up long enough to guarantee that > TASK_FINISHED was delivered to the agent; so races like above were possible. > > On Tue, Jan 9, 2018 at 5:33 PM, Ajay V wrote: > >> Hello, >> >> I'm trying to debug a TASK_LOST thats generated on the agent that I see >> on rare occasions. >> >> Following is a log that I'm trying to understand. This is happening after >> the driver.sendStatusUpdate() has been called with a task state of >> TASK_FINISHED from a java executor. It looks to me like the container is >> already exited before the TASK_FINISHED is processed. Is there a timing >> issue here in this version of mesos that is causing this? The effect of >> this problem is that, even though the work of the executor is complete and >> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is >> marked as LOST and the actual update of TASK_FINISHED is ignored. >> >> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for >> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited >> >> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container >> 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >> >> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown >> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' >> >> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource >> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because: >> Failed to get usage: No process found at 28952 >> >> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor >> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >> 20171208-050805-140555025-5050-3470- exited with status 0 >> >> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update >> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- from @0.0.0.0:0 >> >> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task >> ff631ad1-cfab-493e-be18-961581abcf3d >> >> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for >> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109 >> >> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received >> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for >> task ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding >> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- to the slave >> >> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update >> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050 >> >> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager >> successfully handled status update TASK_LOST (UUID: >> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received >> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) >> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up >> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of >> framework 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager >> successfully handled status update acknowledgement (UUID: >> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task >> ff631ad1-cfab-493e-be18-961581abcf3d of framework >> 20171208-050805-140555025-5050-3470- >> >> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task >> ff631ad1-cfab-493e-be18-961581abcf3d >> >> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor >> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework >> 20171208-050805-140555025-5050-3470- >> >> Regards, >> Ajay >> > >
Re: New 1.5 Marathon deb package - no documentation
Uh ? It's not even provided in the package: root@dev:~# dpkg -L marathon | grep ini root@dev:~# On 01/10/2018 10:01 AM, haosdent wrote: marathon 1.5 use /usr/share/marathon/conf/application.ini as configure file. On Wed, Jan 10, 2018 at 4:59 PM, Adam Cecile> wrote: Hello, I'm testing Mesos 1.4 + marathon 1.5 update but I cannot understand how marathon 1.5 deb package works. Marathon binary seems to completely ignore my /etc/marathon/* config files used by previous version and when looking at the systemd file, I do not understand how to pass startup command line switches in this version. Is there any documentation I missed ? Thanks. -- Best Regards, Haosdent Huang
Re: New 1.5 Marathon deb package - no documentation
marathon 1.5 use /usr/share/marathon/conf/application.ini as configure file. On Wed, Jan 10, 2018 at 4:59 PM, Adam Cecilewrote: > Hello, > > > I'm testing Mesos 1.4 + marathon 1.5 update but I cannot understand how > marathon 1.5 deb package works. > > Marathon binary seems to completely ignore my /etc/marathon/* config files > used by previous version and when looking at the systemd file, I do not > understand how to pass startup command line switches in this version. > > > Is there any documentation I missed ? > > > Thanks. > > -- Best Regards, Haosdent Huang
New 1.5 Marathon deb package - no documentation
Hello, I'm testing Mesos 1.4 + marathon 1.5 update but I cannot understand how marathon 1.5 deb package works. Marathon binary seems to completely ignore my /etc/marathon/* config files used by previous version and when looking at the systemd file, I do not understand how to pass startup command line switches in this version. Is there any documentation I missed ? Thanks.