[jira] [Comment Edited] (MESOS-9624) Bundle CSI spec v1.0 in Mesos.

2019-04-01 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807429#comment-16807429
 ] 

Chun-Hung Hsiao edited comment on MESOS-9624 at 4/2/19 6:21 AM:


Reviews:
[https://reviews.apache.org/r/70302/]
[https://reviews.apache.org/r/70303/]
[https://reviews.apache.org/r/70360/]
[https://reviews.apache.org/r/70361/]


was (Author: chhsia0):
Reviews:
[https://reviews.apache.org/r/70302/]
[https://reviews.apache.org/r/70303/]
[https://reviews.apache.org/r/70360/]

> Bundle CSI spec v1.0 in Mesos.
> --
>
> Key: MESOS-9624
> URL: https://issues.apache.org/jira/browse/MESOS-9624
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> We need to bundle both CSI v0 and v1 in Mesos. This requires some redesign of 
> the source code filesystem layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9623) Implement CSI volume manager with CSI v1.

2019-04-01 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9623:
--

Assignee: Chun-Hung Hsiao

> Implement CSI volume manager with CSI v1.
> -
>
> Key: MESOS-9623
> URL: https://issues.apache.org/jira/browse/MESOS-9623
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9624) Bundle CSI spec v1.0 in Mesos.

2019-04-01 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807429#comment-16807429
 ] 

Chun-Hung Hsiao edited comment on MESOS-9624 at 4/2/19 6:18 AM:


Reviews:
[https://reviews.apache.org/r/70302/]
[https://reviews.apache.org/r/70303/]
[https://reviews.apache.org/r/70360/]


was (Author: chhsia0):
Reviews:
[https://reviews.apache.org/r/70302/]
[https://reviews.apache.org/r/70303/
] [https://reviews.apache.org/r/70360/]

> Bundle CSI spec v1.0 in Mesos.
> --
>
> Key: MESOS-9624
> URL: https://issues.apache.org/jira/browse/MESOS-9624
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> We need to bundle both CSI v0 and v1 in Mesos. This requires some redesign of 
> the source code filesystem layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9624) Bundle CSI spec v1.0 in Mesos.

2019-04-01 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-9624:
--

Assignee: Chun-Hung Hsiao

> Bundle CSI spec v1.0 in Mesos.
> --
>
> Key: MESOS-9624
> URL: https://issues.apache.org/jira/browse/MESOS-9624
> Project: Mesos
>  Issue Type: Task
>  Components: storage
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
>
> We need to bundle both CSI v0 and v1 in Mesos. This requires some redesign of 
> the source code filesystem layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-04-01 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807338#comment-16807338
 ] 

Jie Yu commented on MESOS-9529:
---

https://reviews.apache.org/r/70355/
https://reviews.apache.org/r/70356/

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9529) `/proc` should be remounted even if a nested container set `share_pid_namespace` to true

2019-04-01 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-9529:
-

Assignee: Jie Yu

> `/proc` should be remounted even if a nested container set 
> `share_pid_namespace` to true
> 
>
> Key: MESOS-9529
> URL: https://issues.apache.org/jira/browse/MESOS-9529
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.4.2, 1.5.2, 1.6.2, 1.7.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Critical
>
> Currently, if a nested container wants to share the pid namespace of its 
> parent container, we allow the framework to set 
> `LinuxInfo.share_pid_namespace`.
> If the nested container does not have its own rootfs (i.e., using the host 
> rootfs), the `/proc` is not re-mounted:
> https://github.com/apache/mesos/blob/1.7.x/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L120-L126
> This is problematic because the nested container will fork host's mount 
> namespace, thus inherit the `/proc` there. As a result, `/proc/` are 
> still for the host pid namespace. The pid namespace of the parent container 
> might be different than that of the host pid namspace.
> As a result, `ps aux` in the nested container will show all process 
> information on the host pid namespace. Although, the pid namespace of the 
> nested container is different than that of the host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9694) Refactor UCR docker store to construct 'Image' protobuf at Puller.

2019-04-01 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9694:
---

 Summary: Refactor UCR docker store to construct 'Image' protobuf 
at Puller.
 Key: MESOS-9694
 URL: https://issues.apache.org/jira/browse/MESOS-9694
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Gilbert Song
Assignee: Gilbert Song






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9504) Use ResourceQuantities in the allocator and sorter to improve performance.

2019-04-01 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16807117#comment-16807117
 ] 

Meng Zhu commented on MESOS-9504:
-

commit a5af0162dbe77ba3d76d94b2a2b4beb92d63f83a
Author: Meng Zhu m...@mesosphere.io
Date:   Fri Mar 29 15:39:17 2019 -0700

{noformat}
Used `ResourceQuantities` in `__allocate()` when appropriate.

Replaced `Resources` with `ResourceQuantities` when
appropriate in `__allocate()`. This simplifies the
code and improves performance.

Review: https://reviews.apache.org/r/70320
{noformat}

{noformat}
commit ef26c536f89c76ec2752299096bc5f5538753ed2
Author: Meng Zhu m...@mesosphere.io
Date:   Wed Mar 27 17:17:32 2019 -0700


Used `ResourceQuantities` in the sorter when appropriate.

Replaced `Resources` with `ResourceQuantities` when
appropriate in the sorter. This simplifies the code
and improves performance.

Review: https://reviews.apache.org/r/70330
{noformat}

> Use ResourceQuantities in the allocator and sorter to improve performance.
> --
>
> Key: MESOS-9504
> URL: https://issues.apache.org/jira/browse/MESOS-9504
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Major
>  Labels: performance
>
> In allocator and sorter, we need to do a lot of quantity calculations. 
> Currently, we use the full {{Resources}} type with utilities like 
> {{createScalarResourceQuantities()}}, even though we only care about 
> quantities. Replace {{Resources}} with {{ResourceQuantities}}.
> See:
> https://github.com/apache/mesos/blob/386b1fe99bb9d10af2abaca4832bf584b6181799/src/master/allocator/sorter/drf/sorter.hpp#L444-L445
> https://reviews.apache.org/r/70061/
> With the addition of ResourceQuantities, callers can now just do 
> {{ResourceQuantities.fromScalarResources(r.scalars())}} instead of using 
> {{Resources::createStrippedScalarQuantity()}}, which should actually be a bit 
> more efficient since we only copy the shared pointers rather than construct 
> new `Resource` objects.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9352) Data in persistent volume deleted accidentally when using Docker container and Persistent volume

2019-04-01 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9352:


Assignee: Joseph Wu

> Data in persistent volume deleted accidentally when using Docker container 
> and Persistent volume
> 
>
> Key: MESOS-9352
> URL: https://issues.apache.org/jira/browse/MESOS-9352
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker
>Affects Versions: 1.5.1, 1.5.2
> Environment: DCOS 1.11.6
> Mesos 1.5.2
>Reporter: David Ko
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: dcos, dcos-1.11.6, mesosphere, persistent-volumes
> Attachments: image-2018-10-24-22-20-51-059.png, 
> image-2018-10-24-22-21-13-399.png
>
>
> Using docker image w/ persistent volume to start a service, it will cause 
> data in persistent volume deleted accidentally when task killed and 
> restarted, also old mount points not unmounted, even the service already 
> deleted. 
> *The expected result should be data in persistent volume kept until task 
> deleted completely, also dangling mount points should be unmounted correctly.*
>  
> *Step 1:* Use below JSON config to create a Mysql server using Docker image 
> and Persistent Volume
> {code:javascript}
> {
>   "env": {
> "MYSQL_USER": "wordpress",
> "MYSQL_PASSWORD": "secret",
> "MYSQL_ROOT_PASSWORD": "supersecret",
> "MYSQL_DATABASE": "wordpress"
>   },
>   "id": "/mysqlgc",
>   "backoffFactor": 1.15,
>   "backoffSeconds": 1,
>   "constraints": [
> [
>   "hostname",
>   "IS",
>   "172.27.12.216"
> ]
>   ],
>   "container": {
> "portMappings": [
>   {
> "containerPort": 3306,
> "hostPort": 0,
> "protocol": "tcp",
> "servicePort": 1
>   }
> ],
> "type": "DOCKER",
> "volumes": [
>   {
> "persistent": {
>   "type": "root",
>   "size": 1000,
>   "constraints": []
> },
> "mode": "RW",
> "containerPath": "mysqldata"
>   },
>   {
> "containerPath": "/var/lib/mysql",
> "hostPath": "mysqldata",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "mysql",
>   "forcePullImage": false,
>   "privileged": false,
>   "parameters": []
> }
>   },
>   "cpus": 1,
>   "disk": 0,
>   "instances": 1,
>   "maxLaunchDelaySeconds": 3600,
>   "mem": 512,
>   "gpus": 0,
>   "networks": [
> {
>   "mode": "container/bridge"
> }
>   ],
>   "residency": {
> "relaunchEscalationTimeoutSeconds": 3600,
> "taskLostBehavior": "WAIT_FOREVER"
>   },
>   "requirePorts": false,
>   "upgradeStrategy": {
> "maximumOverCapacity": 0,
> "minimumHealthCapacity": 0
>   },
>   "killSelection": "YOUNGEST_FIRST",
>   "unreachableStrategy": "disabled",
>   "healthChecks": [],
>   "fetch": []
> }
> {code}
> *Step 2:* Kill mysqld process to force rescheduling new Mysql task, but found 
> 2 mount points to the same persistent volume, it means old mount point did 
> not be unmounted immediately.
> !image-2018-10-24-22-20-51-059.png!
> *Step 3:* After GC, data in persistent volume was deleted accidentally, but 
> mysqld (Mesos task) still running
> !image-2018-10-24-22-21-13-399.png!
> *Step 4:* Delete Mysql service from Marathon, all mount points unable to 
> unmount, even the service already deleted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9635) OperationReconciliationTest.AgentPendingOperationAfterMasterFailover is flaky again (3x) due to orphan operations

2019-04-01 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9635:


Assignee: Gastón Kleiman  (was: Greg Mann)

> OperationReconciliationTest.AgentPendingOperationAfterMasterFailover is flaky 
> again (3x) due to orphan operations
> -
>
> Key: MESOS-9635
> URL: https://issues.apache.org/jira/browse/MESOS-9635
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Benno Evers
>Assignee: Gastón Kleiman
>Priority: Blocker
>  Labels: foundations, mesosphere
> Attachments: failure
>
>
> This test fails consistently when run while the system is stressed:
> {code}
> [ RUN  ] 
> ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0
> F0305 08:10:07.670622  3982 hierarchical.cpp:1259] Check failed: 
> slave.getAllocated().contains(resources) {} does not contain disk(allocated: 
> default-role)[RAW(,,profile)]:200
> *** Check failure stack trace: ***
> @ 0x7f1120b0ce5e  google::LogMessage::Fail()
> @ 0x7f1120b0cdbb  google::LogMessage::SendToLog()
> @ 0x7f1120b0c7b5  google::LogMessage::Flush()
> @ 0x7f1120b0f578  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f111e536f2a  
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::recoverResources()
> @ 0x5580c2651c26  
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDERKNS1_7SlaveIDERKNS1_9ResourcesERK6OptionINS1_7FiltersEES8_SB_SE_SJ_EEvRKNS_3PIDIT_EEMSL_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_ENKUlOS6_OS9_OSC_OSH_PNS_11ProcessBaseEE_clES13_S14_S15_S16_S18_
> @ 0x5580c26c7e02  
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDERKNS3_7SlaveIDERKNS3_9ResourcesERK6OptionINS3_7FiltersEESA_SD_SG_SL_EEvRKNS1_3PIDIT_EEMSN_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOS8_OSB_OSE_OSJ_PNS1_11ProcessBaseEE_JS8_SB_SE_SJ_S1A_EEEDTclcl7forwardISN_Efp_Espcl7forwardIT0_Efp0_EEEOSN_DpOS1C_
> @ 0x5580c26c5b1e  
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDERKNS4_7SlaveIDERKNS4_9ResourcesERK6OptionINS4_7FiltersEESB_SE_SH_SM_EEvRKNS2_3PIDIT_EEMSO_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOS9_OSC_OSF_OSK_PNS2_11ProcessBaseEE_JS9_SC_SF_SK_St12_PlaceholderILi113invoke_expandIS1C_St5tupleIJS9_SC_SF_SK_S1E_EES1H_IJOS1B_EEJLm0ELm1ELm2ELm3ELm4DTcl6invokecl7forwardISO_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISS_Efp0_EEcl7forwardIST_Efp2_OSO_OSS_N5cpp1416integer_sequenceImJXspT2_OST_
> @ 0x5580c26c47ac  
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDERKNS4_7SlaveIDERKNS4_9ResourcesERK6OptionINS4_7FiltersEESB_SE_SH_SM_EEvRKNS2_3PIDIT_EEMSO_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOS9_OSC_OSF_OSK_PNS2_11ProcessBaseEE_JS9_SC_SF_SK_St12_PlaceholderILi1clIJS1B_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS1K_
> @ 0x5580c26c3ad7  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_11FrameworkIDERKNS6_7SlaveIDERKNS6_9ResourcesERK6OptionINS6_7FiltersEESD_SG_SJ_SO_EEvRKNS4_3PIDIT_EEMSQ_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOSB_OSE_OSH_OSM_PNS4_11ProcessBaseEE_JSB_SE_SH_SM_St12_PlaceholderILi1EJS1D_EEEDTclcl7forwardISQ_Efp_Espcl7forwardIT0_Efp0_EEEOSQ_DpOS1I_
> @ 0x5580c26c32ad  
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_11FrameworkIDERKNS7_7SlaveIDERKNS7_9ResourcesERK6OptionINS7_7FiltersEESE_SH_SK_SP_EEvRKNS5_3PIDIT_EEMSR_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOSC_OSF_OSI_OSN_PNS5_11ProcessBaseEE_JSC_SF_SI_SN_St12_PlaceholderILi1EJS1E_EEEvOSR_DpOT0_
> @ 0x5580c26c0a5e  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_11FrameworkIDERKNSA_7SlaveIDERKNSA_9ResourcesERK6OptionINSA_7FiltersEESH_SK_SN_SS_EEvRKNS1_3PIDIT_EEMSU_FvT0_T1_T2_T3_EOT4_OT5_OT6_OT7_EUlOSF_OSI_OSL_OSQ_S3_E_JSF_SI_SL_SQ_St12_PlaceholderILi1EEclEOS3_
> @ 0x7f1120a51c60  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> @ 0x7f1120a16a4e  process::ProcessBase::consume()
> @ 0x7f1120a3d9d8  
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> @ 0x5580c2284afa  process::ProcessBase::serve()
> @ 0x7f11

[jira] [Comment Edited] (MESOS-9501) Mesos executor fails to terminate and gets stuck after agent host reboot.

2019-04-01 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806707#comment-16806707
 ] 

Qian Zhang edited comment on MESOS-9501 at 4/1/19 12:55 PM:


This issue can actually happen even without an agent reboot (highly unlikely 
but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with 
checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can 
simulate this by manually changing the task's checkpointed pid in the meta dir 
to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see 
below) for the command task which has actually finished already. This is 
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
  message: 'Unreachable agent re-reregistered'
  source: SOURCE_MASTER
  reason: REASON_AGENT_REREGISTERED{code}
And TASK_RUNNING is the last status update for that task, no any other task 
status updates will be generated for it unless the process which reuses the pid 
terminates. The root cause of this issue is, when agent is started in step 4, 
it will try to destroy the executor container since the executor cannot 
reregister within `–executor_reregistration_timeout` (2 seconds by default), 
but the destroy operation cannot complete because it hangs 
[here|https://github.com/apache/mesos/blob/1.7.2/src/slave/containerizer/mesos/containerizer.cpp#L2684:L2685]
 since Mesos containerizer reaps an irrelevant process.


was (Author: qianzhang):
This issue can actually happen even without an agent reboot (highly unlikely 
but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with 
checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can 
simulate this by manually changing the task's checkpointed pid in the meta dir 
to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see 
below) for the command task which has actually finished already. This is 
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
  message: 'Unreachable agent re-reregistered'
  source: SOURCE_MASTER
  reason: REASON_AGENT_REREGISTERED{code}

> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -
>
> Key: MESOS-9501
> URL: https://issues.apache.org/jira/browse/MESOS-9501
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Meng Zhu
>Assignee: Qian Zhang
>Priority: Critical
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> When an agent host reboots, all of its containers are gone but the agent will 
> still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and 
> assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to 
> wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after 
> the agent host reboot, a new process with the same pid gets spawned, then the 
> parent will wait for the wrong child process. This could get stuck until the 
> wrongly waited-for  process is somehow exited, see `ReaperProcess::wait()`: 
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update 
> (e.g. master might still think the task is running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9501) Mesos executor fails to terminate and gets stuck after agent host reboot.

2019-04-01 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806707#comment-16806707
 ] 

Qian Zhang edited comment on MESOS-9501 at 4/1/19 12:31 PM:


This issue can actually happen even without an agent reboot (highly unlikely 
but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with 
checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can 
simulate this by manually changing the task's checkpointed pid in the meta dir 
to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see 
below) for the command task which has actually finished already. This is 
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
  message: 'Unreachable agent re-reregistered'
  source: SOURCE_MASTER
  reason: REASON_AGENT_REREGISTERED{code}


was (Author: qianzhang):
This issue can actually happen even without an agent reboot (highly unlikely 
but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with 
checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can 
simulate this by manually changing the task's checkpointed pid in the meta dir 
to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see 
below) for the command task which has actually finished already. This is 
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
message: 'Unreachable agent re-reregistered'
source: SOURCE_MASTER
reason: REASON_AGENT_REREGISTERED{code}

> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -
>
> Key: MESOS-9501
> URL: https://issues.apache.org/jira/browse/MESOS-9501
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Meng Zhu
>Assignee: Qian Zhang
>Priority: Critical
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> When an agent host reboots, all of its containers are gone but the agent will 
> still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and 
> assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to 
> wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after 
> the agent host reboot, a new process with the same pid gets spawned, then the 
> parent will wait for the wrong child process. This could get stuck until the 
> wrongly waited-for  process is somehow exited, see `ReaperProcess::wait()`: 
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update 
> (e.g. master might still think the task is running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9501) Mesos executor fails to terminate and gets stuck after agent host reboot.

2019-04-01 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806707#comment-16806707
 ] 

Qian Zhang commented on MESOS-9501:
---

This issue can actually happen even without an agent reboot (highly unlikely 
but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with 
checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can 
simulate this by manually changing the task's checkpointed pid in the meta dir 
to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see 
below) for the command task which has actually finished already. This is 
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
message: 'Unreachable agent re-reregistered'
source: SOURCE_MASTER
reason: REASON_AGENT_REREGISTERED{code}

> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -
>
> Key: MESOS-9501
> URL: https://issues.apache.org/jira/browse/MESOS-9501
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Meng Zhu
>Assignee: Qian Zhang
>Priority: Critical
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> When an agent host reboots, all of its containers are gone but the agent will 
> still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and 
> assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to 
> wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after 
> the agent host reboot, a new process with the same pid gets spawned, then the 
> parent will wait for the wrong child process. This could get stuck until the 
> wrongly waited-for  process is somehow exited, see `ReaperProcess::wait()`: 
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update 
> (e.g. master might still think the task is running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)