Mesos slave ID change after reboot

2018-01-10 Thread Srikanth Viswanathan
I am trying to understand under what cases the mesos slave ID changes in
response to reboot.  I noticed this note at
http://mesos.apache.org/documentation/latest/upgrades/#upgrading-from-1-3-x-to-1-4-x
:

Agent is now allowed to recover its agent ID post a host reboot. This
> prevents the unnecessary discarding of agent ID by prior Mesos versions.
> Notes about backwards compatibility:
>
>- In case the agent’s recovery runs into agent info mismatch which may
>happen due to resource change associated with reboot, it’ll fall back to
>recovering as a new agent (existing behavior).
>
>
>- In other cases such as checkpointed resources (e.g. persistent
>volumes) being incompatible with the agent’s resources the recovery will
>still fail (existing behavior).
>
>
I was wondering if the behavior prior to 1.3 is also similarly
well-defined. Is the answer "Will always change after a reboot"?

Thanks,
Srikanth


Re: java driver/shutdown call

2018-01-10 Thread Mohit Jaggi
Thanks Vinod. Is there a V1SchedulerDriver.java file? I see
https://github.com/apache/mesos/tree/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/java/src/org/apache/mesos/v1/scheduler
but it does not have a V1 driver.

On Fri, Jan 5, 2018 at 3:59 PM, Vinod Kone  wrote:

> That's right. It is only available for v1 schedulers.
>
> On Fri, Jan 5, 2018 at 3:38 PM, Mohit Jaggi  wrote:
>
>> Folks,
>> I am trying to change Apache Aurora's code to call SHUTDOWN instead of
>> KILL. However, it seems that the SchedulerDriver class in Mesos does not
>> have a shutdownExecutor() call.
>>
>> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448
>> dc599ef3774339d31/src/java/src/org/apache/mesos/SchedulerDriver.java
>>
>> Mohit.
>>
>
>


Re: Mesos replicated log fills disk with logging output

2018-01-10 Thread Stephan Erb
Thanks for the hint! The cluster is using ext4, and judging from the linked 
thread this could have indeed be caused by a stalling hypervisor.

From: Jie Yu 
Reply-To: "user@mesos.apache.org" 
Date: Monday, 8. January 2018 at 23:36
To: user 
Subject: Re: Mesos replicated log fills disk with logging output

Stephan,

I haven't seen that before. A quick Google search suggests that it might be 
related to leveldb. The following thread might be related.
https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ

What is the filesystem you're using?

- JIe

On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erb 
> wrote:
Hi everyone,

a few days ago, we have bumped into an interesting issue that we had not seen 
before. Essentially, one of our toy clusters dissolved itself:

·  3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper 
(3.4.5) for leader election
·  Master 1 and master 2 had 100% disk usage, because 
/var/lib/mesos/replicated_log/LOG had grown to about 170 GB
·  The replicated log of both Master 1 and 2 was corrupted. A process restart 
did not fix it.
·  The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was 
caused by the full disk.
·  Master 3 was the leading Mesos master and healthy. Its disk usage was normal.


The content of /var/lib/mesos/replicated_log/LOG was an endless stream of:

2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753
2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started
2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK
2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753
2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751
2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started
2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK
2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757
2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760
2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from 
(begin) .. (end); will stop at '003060' @ 9423 : 1
2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761
2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762
2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763

Content of the associated folder:

/var/lib/mesos/replicated_log.corrupted# ls -la
total 964480
drwxr-xr-x 2 mesos mesos  4096 Jan  5 10:12 .
drwxr-xr-x 4 mesos mesos  4096 Jan  5 10:27 ..
-rw-r--r-- 1 mesos mesos   724 Dec 14 16:22 001735.ldb
-rw-r--r-- 1 mesos mesos  7393 Dec 14 16:45 001737.sst
-rw-r--r-- 1 mesos mesos 22129 Jan  3 12:53 001742.sst
-rw-r--r-- 1 mesos mesos 14967 Jan  3 13:00 001747.sst
-rw-r--r-- 1 mesos mesos  7526 Jan  4 12:30 001756.sst
-rw-r--r-- 1 mesos mesos 15113 Jan  5 10:08 001765.sst
-rw-r--r-- 1 mesos mesos 65536 Jan  5 10:09 001767.log
-rw-r--r-- 1 mesos mesos16 Jan  5 10:08 CURRENT
-rw-r--r-- 1 mesos mesos 0 Aug 25  2015 LOCK
-rw-r--r-- 1 mesos mesos 178303865220 Jan  5 10:12 LOG
-rw-r--r-- 1 mesos mesos 463093282 Jan  5 10:08 LOG.old
-rw-r--r-- 1 mesos mesos 65536 Jan  5 10:08 MANIFEST-001764

Monitoring indicates that the disk usage started to grow shortly after a badly 
coordinated configuration deployment change:

·  Master 1 was leading and restarted after a few hours of uptime
·  Master 2 was now leading. After a few seconds (30s-60s or so) it got 
restarted as well
·  Master 3 was now leading (and continued to do so)

I have to admit I am a bit surprised that the restart scenario could lead to 
the issues described above. Has anyone seen similar issues as well?

Thanks and best regards,
Stephan



Re: Mesos rare TASK_LOST scenario v 0.21.0

2018-01-10 Thread Vinod Kone
The command executor was probably fixed somewhere between 0.21 and 1.3. The
only reason I mentioned 1.3+ is because any releases before that are out of
support period. If you can repro the issue with 1.3+ and paste the logs
here or in a JIRA, we can help debug it for you.

On Wed, Jan 10, 2018 at 9:47 AM, Ajay V  wrote:

> Thanks for getting back Vinod. So, does that mean that even for v1.2,
> these race conditions (where the command executor doesn't stay long enough
> ) existed and that 1.3 versions fixes them ?. Reason for asking is because
> I did try an upgrade to v1.2 and still found very similar issues.
>
> Regards,
> Ajay
>
> On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone  wrote:
>
>> 0.21 is really old and not supported. I highly recommend you upgrade to
>> 1.3+.
>>
>> Regarding what you are seeing, we definitely had issues in the past where
>> the command executor didn't stay up long enough to guarantee that
>> TASK_FINISHED was delivered to the agent; so races like above were possible.
>>
>> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V  wrote:
>>
>>> Hello,
>>>
>>> I'm trying to debug a TASK_LOST thats generated on the agent that I see
>>> on rare occasions.
>>>
>>> Following is a log that I'm trying to understand. This is happening
>>> after the driver.sendStatusUpdate() has been called with a task state of
>>> TASK_FINISHED from a java executor. It looks to me like the container is
>>> already exited before the TASK_FINISHED  is processed. Is there a timing
>>> issue here in this version of mesos that is causing this? The effect of
>>> this problem is that, even though the work of the executor is complete and
>>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is
>>> marked as LOST and the actual update of TASK_FINISHED is ignored.
>>>
>>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for
>>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited
>>>
>>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container
>>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>>
>>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown
>>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>>
>>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource
>>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because:
>>> Failed to get usage: No process found at 28952
>>>
>>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor
>>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>>> 20171208-050805-140555025-5050-3470- exited with status 0
>>>
>>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update
>>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470- from @0.0.0.0:0
>>>
>>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task
>>> ff631ad1-cfab-493e-be18-961581abcf3d
>>>
>>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for
>>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109
>>>
>>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received
>>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-
>>>
>>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding
>>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470- to the slave
>>>
>>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update
>>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050
>>>
>>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager
>>> successfully handled status update TASK_LOST (UUID:
>>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-
>>>
>>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received
>>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-
>>>
>>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up
>>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of
>>> framework 20171208-050805-140555025-5050-3470-
>>>
>>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager
>>> successfully handled status update acknowledgement (UUID:
>>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-
>>>
>>> I0108 10:16:52.958693 37268 

Re: Mesos rare TASK_LOST scenario v 0.21.0

2018-01-10 Thread Ajay V
Thanks for getting back Vinod. So, does that mean that even for v1.2, these
race conditions (where the command executor doesn't stay long enough )
existed and that 1.3 versions fixes them ?. Reason for asking is because I
did try an upgrade to v1.2 and still found very similar issues.

Regards,
Ajay

On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone  wrote:

> 0.21 is really old and not supported. I highly recommend you upgrade to
> 1.3+.
>
> Regarding what you are seeing, we definitely had issues in the past where
> the command executor didn't stay up long enough to guarantee that
> TASK_FINISHED was delivered to the agent; so races like above were possible.
>
> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V  wrote:
>
>> Hello,
>>
>> I'm trying to debug a TASK_LOST thats generated on the agent that I see
>> on rare occasions.
>>
>> Following is a log that I'm trying to understand. This is happening after
>> the driver.sendStatusUpdate() has been called with a task state of
>> TASK_FINISHED from a java executor. It looks to me like the container is
>> already exited before the TASK_FINISHED  is processed. Is there a timing
>> issue here in this version of mesos that is causing this? The effect of
>> this problem is that, even though the work of the executor is complete and
>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is
>> marked as LOST and the actual update of TASK_FINISHED is ignored.
>>
>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for
>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited
>>
>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container
>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>
>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown
>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>
>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource
>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because:
>> Failed to get usage: No process found at 28952
>>
>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor
>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>> 20171208-050805-140555025-5050-3470- exited with status 0
>>
>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update
>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470- from @0.0.0.0:0
>>
>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task
>> ff631ad1-cfab-493e-be18-961581abcf3d
>>
>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for
>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109
>>
>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received
>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for
>> task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-
>>
>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding
>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470- to the slave
>>
>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update
>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470- to master@17.179.96.8:5050
>>
>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager
>> successfully handled status update TASK_LOST (UUID:
>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-
>>
>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received
>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-
>>
>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up
>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of
>> framework 20171208-050805-140555025-5050-3470-
>>
>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager
>> successfully handled status update acknowledgement (UUID:
>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-
>>
>> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task
>> ff631ad1-cfab-493e-be18-961581abcf3d
>>
>> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor
>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>> 20171208-050805-140555025-5050-3470-
>>
>> Regards,
>> Ajay
>>
>
>


Re: New 1.5 Marathon deb package - no documentation

2018-01-10 Thread Adam Cecile

Uh ?

It's not even provided in the package:

root@dev:~# dpkg -L marathon | grep ini
root@dev:~#

On 01/10/2018 10:01 AM, haosdent wrote:
marathon 1.5 use /usr/share/marathon/conf/application.ini as configure 
file.


On Wed, Jan 10, 2018 at 4:59 PM, Adam Cecile > wrote:


Hello,


I'm testing Mesos 1.4 + marathon 1.5 update but I cannot
understand how marathon 1.5 deb package works.

Marathon binary seems to completely ignore my /etc/marathon/*
config files used by previous version and when looking at the
systemd file, I do not understand how to pass startup command line
switches in this version.


Is there any documentation I missed ?


Thanks.




--
Best Regards,
Haosdent Huang





Re: New 1.5 Marathon deb package - no documentation

2018-01-10 Thread haosdent
marathon 1.5 use /usr/share/marathon/conf/application.ini as configure file.

On Wed, Jan 10, 2018 at 4:59 PM, Adam Cecile  wrote:

> Hello,
>
>
> I'm testing Mesos 1.4 + marathon 1.5 update but I cannot understand how
> marathon 1.5 deb package works.
>
> Marathon binary seems to completely ignore my /etc/marathon/* config files
> used by previous version and when looking at the systemd file, I do not
> understand how to pass startup command line switches in this version.
>
>
> Is there any documentation I missed ?
>
>
> Thanks.
>
>


-- 
Best Regards,
Haosdent Huang


New 1.5 Marathon deb package - no documentation

2018-01-10 Thread Adam Cecile

Hello,


I'm testing Mesos 1.4 + marathon 1.5 update but I cannot understand how 
marathon 1.5 deb package works.


Marathon binary seems to completely ignore my /etc/marathon/* config 
files used by previous version and when looking at the systemd file, I 
do not understand how to pass startup command line switches in this version.



Is there any documentation I missed ?


Thanks.