[jira] [Commented] (MESOS-2171) Compilation error on Ubuntu 12.04

2015-03-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370833#comment-14370833
 ] 

haosdent commented on MESOS-2171:
-

Sorry, I just found the status of this issue is Unresolved

 Compilation error on Ubuntu 12.04 
 --

 Key: MESOS-2171
 URL: https://issues.apache.org/jira/browse/MESOS-2171
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 0.22.0
 Environment: Ubuntu 12.04 
 java version 1.6.0_33
 gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
Reporter: Mark Luntzel
Priority: Blocker

 Following http://mesos.apache.org/gettingstarted/ we get a compilation error:
 ../../../3rdparty/libprocess/src/clock.cpp:167:36:   instantiated from here
 /usr/include/c++/4.6/tr1/functional:2040:46: error: invalid initialization of 
 reference of type 'std::listprocess::Timer' from expression of type 
 'std::listprocess::Timer'
 /usr/include/c++/4.6/tr1/functional:2040:46: error: return-statement with a 
 value, in function returning 'void' [-fpermissive]
 make[4]: *** [libprocess_la-clock.lo] Error 1
 More output here: 
 https://gist.github.com/luntzel/f5a3c62297aae812c986
 Please advise. Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370808#comment-14370808
 ] 

Ian Downes commented on MESOS-2367:
---

My opinion is that we should not change the contract between the launcher and 
the isolators: Isolator::cleanup will only be called when all processes in the 
container have been terminated.

Why?
1. The single launcher is responsible for container process lifetime, multiple 
isolators are responsible for isolating those processes
2. Many isolators cannot complete cleanup until all processes are destroyed - 
in which case they're all trying to do the same thing the launcher is doing, or 
what else could the isolators do differently?
3. Isolators are ordered arbitrarily and called concurrently, so there's no way 
to ensure, for example, the cpu isolator is called first.

My suggestion is that we do the following:
1. Make orphan clean up failures non-fatal so the slave will start and we gain 
control over running tasks.
2. Add a counter for the number of containers that failed to be destroyed, 
separately counting those that fail on normal destroy and those orphans that 
fail to be destroyed. Operators can monitor these counters and act 
appropriately.
3. Add to the launcher destroy code to handle the case described here where 
there are processes terminating (unmapping pages) but not making progress 
because of the very low cpu quota (the minimum is 0.01). If cgroup::destroy() 
timed out out it would examine the process's cgroup (/proc/\[pid\]/cgroup) and 
increase the cpu quota to something like 0.5 or 1.0 cpu and try again. This is 
a workaround and it is going around the cpu isolator but I don't see a cleaner 
way to do it.

The case that I triaged had a JVM process with 16 GB of anonymous pages to 
unmap and it took around 16 seconds once the cpu quota was increased. I expect 
one or two additional attempts at terminating the processes and cgroup::destroy 
to be successful in all but the most extreme cases of this scenario.

Regardless of success (there are other potential failure modes), (1) and (2) 
would enable the slave to come back up and to alert operators.

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2438) Improve support for streaming HTTP Responses in libprocess.

2015-03-20 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370826#comment-14370826
 ] 

Benjamin Mahler commented on MESOS-2438:


Server-side sending streaming responses:

{noformat}
commit e76954abb37a30da5bb211829d7033e53d830a7f
Author: Benjamin Mahler benjamin.mah...@gmail.com
Date:   Thu Mar 5 18:33:28 2015 -0800

Introduced an http::Pipe abstraction to simplify streaming HTTP Responses.

Review: https://reviews.apache.org/r/31930
{noformat}

Patches for making streaming http::get / http::post requests are coming soon.

 Improve support for streaming HTTP Responses in libprocess.
 ---

 Key: MESOS-2438
 URL: https://issues.apache.org/jira/browse/MESOS-2438
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
  Labels: twitter

 Currently libprocess' HTTP::Response supports a PIPE construct for doing 
 streaming responses:
 {code}
 struct Response
 {
   ...
   // Either provide a body, an absolute path to a file, or a
   // pipe for streaming a response. Distinguish between the cases
   // using 'type' below.
   //
   // BODY: Uses 'body' as the body of the response. These may be
   // encoded using gzip for efficiency, if 'Content-Encoding' is not
   // already specified.
   //
   // PATH: Attempts to perform a 'sendfile' operation on the file
   // found at 'path'.
   //
   // PIPE: Splices data from 'pipe' using 'Transfer-Encoding=chunked'.
   // Note that the read end of the pipe will be closed by libprocess
   // either after the write end has been closed or if the socket the
   // data is being spliced to has been closed (i.e., nobody is
   // listening any longer). This can cause writes to the pipe to
   // generate a SIGPIPE (which will terminate your program unless you
   // explicitly ignore them or handle them).
   //
   // In all cases (BODY, PATH, PIPE), you are expected to properly
   // specify the 'Content-Type' header, but the 'Content-Length' and
   // or 'Transfer-Encoding' headers will be filled in for you.
   enum {
 NONE,
 BODY,
 PATH,
 PIPE
   } type;
   ...
 };
 {code}
 This interface is too low level and difficult to program against:
 * Connection closure is signaled with SIGPIPE, which is difficult for callers 
 to deal with (must suppress SIGPIPE locally or globally in order to get EPIPE 
 instead).
 * Pipes are generally for inter-process communication, and the pipe has 
 finite size. With a blocking pipe the caller must deal with blocking when the 
 pipe's buffer limit is exceeded. With a non-blocking pipe, the caller must 
 deal with retrying the write.
 We'll want to consider a few use cases:
 # Sending an HTTP::Response with streaming data.
 # Making a request with http::get and http::post in which the data is 
 returned in a streaming manner.
 # Making a request in which the request content is streaming.
 This ticket will focus on 1 as it is required for the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2522) Add reason field for framework errors

2015-03-20 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-2522:
--
Description: 
Currently, the only insight into framework errors is a message string.  
Framework schedulers could probably be smarter about how to handle errors if 
the cause is known.  Since there are only a handful of distinct cases that 
could trigger an error, they could be captured by an enumeration.

One specific use case for this feature follows. Frameworks that intend to 
survive failover typicaly persist the FrameworkID somewhere.  When a framework 
has been marked completed by the master for exceeding its configured failover 
timeout, then re-registration triggers a framework error.  Probably, the 
scheduler wants to disambiguate this kind of framework error from others in 
order to invalidate the stashed FrameworkID for the next attempt at 
(re)registration.

  was:
Currently, the only insight into framework errors is a message string.  
Framework schedulers could probably be smarter about how to handle errors if 
the cause is known.  Since there are only a handful of distinct cases that 
could trigger an error, they could be captured by an enumeration.

One specific use case for this feature follows. Frameworks that intend to 
survive failover typicaly persist the FrameworkID somewhere.  When a framework 
has been marked completed by the master for exceeding its configured failover 
timeout, then re-registration triggers a framework error.  Probably, the 
scheduler wants to disambiguate this kind of framework error from others in 
order to invalidate the stashed FrameworkID for the next attempt at 
(re)gistration.


 Add reason field for framework errors
 -

 Key: MESOS-2522
 URL: https://issues.apache.org/jira/browse/MESOS-2522
 Project: Mesos
  Issue Type: Improvement
  Components: master
Affects Versions: 0.22.0
Reporter: Connor Doyle
Priority: Minor
  Labels: mesosphere

 Currently, the only insight into framework errors is a message string.  
 Framework schedulers could probably be smarter about how to handle errors if 
 the cause is known.  Since there are only a handful of distinct cases that 
 could trigger an error, they could be captured by an enumeration.
 One specific use case for this feature follows. Frameworks that intend to 
 survive failover typicaly persist the FrameworkID somewhere.  When a 
 framework has been marked completed by the master for exceeding its 
 configured failover timeout, then re-registration triggers a framework error. 
  Probably, the scheduler wants to disambiguate this kind of framework error 
 from others in order to invalidate the stashed FrameworkID for the next 
 attempt at (re)registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2368) Provide a backchannel for information to the framework

2015-03-20 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen reassigned MESOS-2368:
---

Assignee: Timothy Chen  (was: Henning Schmiedehausen)

 Provide a backchannel for information to the framework
 --

 Key: MESOS-2368
 URL: https://issues.apache.org/jira/browse/MESOS-2368
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, docker
Reporter: Henning Schmiedehausen
Assignee: Timothy Chen

 So that description is not very verbose. Here is my use case:
 In our usage of Mesos and Docker, we assign IPs when the container starts up. 
 We can not allocate the IP ahead of time, but we must rely on docker to give 
 our containers their IP. This IP can be examined through docker inspect. 
 We added code to the docker containerizer that will pick up this information 
 and add it to an optional protobuf struct in the TaskStatus message. 
 Therefore, when the executor and slave report a task as running, the 
 corresponding message will contain information about the IP address that the 
 container was assigned by docker and we can pick up this information in our 
 orchestration framework. E.g. to drive our load balancers.
 There was no good way to do that in stock Mesos, so we built that back 
 channel. However, having a generic channel (not one for four pieces of 
 arbitrary information) from the executor to a framework may be a good thing 
 in general. Clearly, this information could be transferred out of band but 
 having it in the standard Mesos communication protocol turned out to be very 
 elegant.
 To turn our current, hacked, prototype into something useful, this is what I 
 am thinking:
 - TaskStatus gains a new, optional field:
   - optional TaskContext task_context = 11; (better name suggestions very 
 welcome)
 - TaskContext has optional fields:
   - optional ContainerizerContext containerizer_context = 1;
   - optional ExecutorContext executor_context = 2;
 Each executor and containerizer can add information to the TaskContext, which 
 in turn is exposed in TaskStatus. To avoid crowding of the various fields, I 
 want to experiment with the nested extensions as described here: 
 http://www.indelible.org/ink/protobuf-polymorphism/
 At the end of the day, the goal is that any piece that is involved in 
 executing code on the slave side can send information back to the framework 
 along with TaskStatus messages. Any of these fields should be optional to be 
 backwards compatible and they should (same as any other messages back) be 
 considered best effort, but it will allow an effective way to communicate 
 execution environment state back to the framework and allow the framework to 
 react on it.
 I am planning to work on this an present a cleaned up version of our 
 prototype in a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2508) Slave recovering a docker container results in Unknow container error

2015-03-20 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370870#comment-14370870
 ] 

Timothy Chen commented on MESOS-2508:
-

Hi there, I think this is a known issue. Basically all containerizers attempts 
to recover all the tasks that aren't even launched by itself, so in this case 
Docker is trying to recover Mesos containerizer tasks .



 Slave recovering a docker container results in Unknow container error
 ---

 Key: MESOS-2508
 URL: https://issues.apache.org/jira/browse/MESOS-2508
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, slave
Affects Versions: 0.21.1
 Environment: Ubuntu 14.04.2 LTS
 Docker 1.5.0 (same error with 1.4.1)
 Mesos 0.21.1 installed from mesosphere ubuntu repo
 Marathon 0.8.0 installed from mesosphere ubuntu repo
Reporter: Geoffroy Jabouley

 I'm seeing some error logs occuring during a slave recovery of a Mesos task 
 running into a docker container.
 It does not impede slave recovery process, as the mesos task is still active 
 and running on the slave afeter the recovery.
 But there is something not working properly when the slave is recovering my 
 docker container. The slave detects my container as an Unknown container
 Cluster status:
 - 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host.
 - checkpointing is activated on both slave and framework
 - use native docker containerizer
 - 1 mesos task, started using marathon, is running inside a docker container 
 and is monitored by the mesos-slave
 Action:
 - restart the mesos-slave process (sudo restart mesos-slave)
 Expected:
 - docker container still running
 - mesos task still running
 - no error in the mesos slave log regarding recovery process
 Seen:
 - docker container still running
 - mesos task still running
 - {color:red}Several errors *Unknown container* in the mesos slave log during 
 recovery process{color}
 ---
 For what it forth, here are my investigations:
 1) The mesos task starts fine in the docker container 
 *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker 
 container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by 
 Mesos docker containerizer _i guess_...
 {code}
 I0317 09:56:14.300439  2784 slave.cpp:1083] Got assigned task 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
 20150311-150951-3982541578-5050-50860-
 I0317 09:56:14.380702  2784 slave.cpp:1193] Launching task 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
 20150311-150951-3982541578-5050-50860-
 I0317 09:56:14.384466  2784 slave.cpp:3997] Launching executor 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
 20150311-150951-3982541578-5050-50860- in work directory 
 '/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad'
 I0317 09:56:14.390207  2784 slave.cpp:1316] Queuing task 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
 '20150311-150951-3982541578-5050-50860-
 I0317 09:56:14.421787  2782 docker.cpp:927] Starting container 
 'adb71dc4-c07d-42a9-8fed-264c241668ad' for task 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework 
 '20150311-150951-3982541578-5050-50860-'
 I0317 09:56:15.784143  2781 docker.cpp:633] Checkpointing pid 27080 to 
 '/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid'
 I0317 09:56:15.789443  2784 slave.cpp:2840] Monitoring executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
 '20150311-150951-3982541578-5050-50860-' in container 
 'adb71dc4-c07d-42a9-8fed-264c241668ad'
 I0317 09:56:15.862642  2784 slave.cpp:1860] Got registration for executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
 20150311-150951-3982541578-5050-50860- from 
 executor(1)@10.195.96.237:36021
 I0317 09:56:15.865319  2784 slave.cpp:1979] Flushing queued task 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
 20150311-150951-3982541578-5050-50860-
 I0317 09:56:15.885414  2787 slave.cpp:2215] Handling status update 
 TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
 

[jira] [Commented] (MESOS-2522) Add reason field for framework errors

2015-03-20 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370894#comment-14370894
 ] 

Adam B commented on MESOS-2522:
---

This request is very much in line with the TaskStatus Reason field recently 
added in https://issues.apache.org/jira/browse/MESOS-343

 Add reason field for framework errors
 -

 Key: MESOS-2522
 URL: https://issues.apache.org/jira/browse/MESOS-2522
 Project: Mesos
  Issue Type: Improvement
  Components: master
Affects Versions: 0.22.0
Reporter: Connor Doyle
Priority: Minor
  Labels: mesosphere

 Currently, the only insight into framework errors is a message string.  
 Framework schedulers could probably be smarter about how to handle errors if 
 the cause is known.  Since there are only a handful of distinct cases that 
 could trigger an error, they could be captured by an enumeration.
 One specific use case for this feature follows. Frameworks that intend to 
 survive failover typicaly persist the FrameworkID somewhere.  When a 
 framework has been marked completed by the master for exceeding its 
 configured failover timeout, then re-registration triggers a framework error. 
  Probably, the scheduler wants to disambiguate this kind of framework error 
 from others in order to invalidate the stashed FrameworkID for the next 
 attempt at (re)registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2491) Persist the reservation state on the slave

2015-03-20 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-2491:

Sprint: Mesosphere Q1 Sprint 6 - 4/3

 Persist the reservation state on the slave
 --

 Key: MESOS-2491
 URL: https://issues.apache.org/jira/browse/MESOS-2491
 Project: Mesos
  Issue Type: Task
  Components: master, slave
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 h3. Goal
 The goal for this task is to persist the reservation state stored on the 
 master on the corresponding slave. The {{needCheckpointing}} predicate is 
 used to capture the condition for which a resource needs to be checkpointed. 
 Currently the only condition is {{isPersistentVolume}}. We'll update this to 
 include dynamically reserved resources.
 h3. Expected Outcome
 * The dynamically reserved resources will be persisted on the slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2438) Improve support for streaming HTTP Responses in libprocess.

2015-03-20 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372339#comment-14372339
 ] 

Benjamin Mahler commented on MESOS-2438:


Did some cleanups along the way, these are the key reviews in the chain:

https://reviews.apache.org/r/32346/
https://reviews.apache.org/r/32347/
https://reviews.apache.org/r/32351/

 Improve support for streaming HTTP Responses in libprocess.
 ---

 Key: MESOS-2438
 URL: https://issues.apache.org/jira/browse/MESOS-2438
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
  Labels: twitter

 Currently libprocess' HTTP::Response supports a PIPE construct for doing 
 streaming responses:
 {code}
 struct Response
 {
   ...
   // Either provide a body, an absolute path to a file, or a
   // pipe for streaming a response. Distinguish between the cases
   // using 'type' below.
   //
   // BODY: Uses 'body' as the body of the response. These may be
   // encoded using gzip for efficiency, if 'Content-Encoding' is not
   // already specified.
   //
   // PATH: Attempts to perform a 'sendfile' operation on the file
   // found at 'path'.
   //
   // PIPE: Splices data from 'pipe' using 'Transfer-Encoding=chunked'.
   // Note that the read end of the pipe will be closed by libprocess
   // either after the write end has been closed or if the socket the
   // data is being spliced to has been closed (i.e., nobody is
   // listening any longer). This can cause writes to the pipe to
   // generate a SIGPIPE (which will terminate your program unless you
   // explicitly ignore them or handle them).
   //
   // In all cases (BODY, PATH, PIPE), you are expected to properly
   // specify the 'Content-Type' header, but the 'Content-Length' and
   // or 'Transfer-Encoding' headers will be filled in for you.
   enum {
 NONE,
 BODY,
 PATH,
 PIPE
   } type;
   ...
 };
 {code}
 This interface is too low level and difficult to program against:
 * Connection closure is signaled with SIGPIPE, which is difficult for callers 
 to deal with (must suppress SIGPIPE locally or globally in order to get EPIPE 
 instead).
 * Pipes are generally for inter-process communication, and the pipe has 
 finite size. With a blocking pipe the caller must deal with blocking when the 
 pipe's buffer limit is exceeded. With a non-blocking pipe, the caller must 
 deal with retrying the write.
 We'll want to consider a few use cases:
 # Sending an HTTP::Response with streaming data.
 # Making a request with http::get and http::post in which the data is 
 returned in a streaming manner.
 # Making a request in which the request content is streaming.
 This ticket will focus on 1 as it is required for the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2528) Symlink the namespace handle with ContainerID for the port mapping isolator.

2015-03-20 Thread Jie Yu (JIRA)
Jie Yu created MESOS-2528:
-

 Summary: Symlink the namespace handle with ContainerID for the 
port mapping isolator.
 Key: MESOS-2528
 URL: https://issues.apache.org/jira/browse/MESOS-2528
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu


This serves two purposes:
1) Allows us to enter the network namespace using container ID (instead of 
pid): ip netns exec ContainerID [commands] [args].
2) Allows us to get container ID for orphan containers during recovery. This 
will be helpful for solving MESOS-2367.

The challenge here is to solve it in a backward compatible way. I propose to 
create symlinks under /var/run/netns. For example:
/var/run/netns/containerid -- /var/run/netns/12345
(12345 is the pid)

The old code will only remove the bind mounts and leave the symlinks, which I 
think is fine since containerid is globally unique (uuid).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2525) Missing information in Python interface launchTasks scheduler method

2015-03-20 Thread Itamar Ostricher (JIRA)
Itamar Ostricher created MESOS-2525:
---

 Summary: Missing information in Python interface launchTasks 
scheduler method
 Key: MESOS-2525
 URL: https://issues.apache.org/jira/browse/MESOS-2525
 Project: Mesos
  Issue Type: Documentation
  Components: python api
Affects Versions: 0.21.0
Reporter: Itamar Ostricher


The docstring of the launchTasks scheduler method in the Python API should 
explicitly state that launching multiple tasks onto multiple offers is 
supported only as long as all offers are from the same slave.

See mailing list thread: 
http://www.mail-archive.com/user@mesos.apache.org/msg02861.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2524) Mesos-containerizer not linked from main documentation page.

2015-03-20 Thread Joerg Schad (JIRA)
Joerg Schad created MESOS-2524:
--

 Summary: Mesos-containerizer not linked from main documentation 
page.
 Key: MESOS-2524
 URL: https://issues.apache.org/jira/browse/MESOS-2524
 Project: Mesos
  Issue Type: Documentation
Reporter: Joerg Schad
Assignee: Joerg Schad
Priority: Minor


Is there any reason that the mesos-containerizer 
(http://mesos.apache.org/documentation/latest/mesos-containerizer/) 
documentation is not linked from the main documentation page 
(http://mesos.apache.org/documentation/latest/)? Both docker and external 
conterizer pages are linked from here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2018) Dynamic Reservation

2015-03-20 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-2018:

Description: 
This is a feature to provide better support for running stateful services on 
Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), 
or MySQL (Local Database).
Current resource reservations (henceforth called static reservations) are 
statically determined by the slave operator at slave start time, and individual 
frameworks have no authority to reserve resources themselves.
Dynamic reservations allow a framework to dynamically reserve offered 
resources, such that those resources will only be re-offered to the same 
framework (or other frameworks with the same role).
This is especially useful if the framework's task stored some state on the 
slave, and needs a guaranteed set of resources reserved so that it can 
re-launch a task on the same slave to recover that state.

h3. Planned Stages

[MESOS-2489|

  was:
This is a feature to provide better support for running stateful services on 
Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), 
or MySQL (Local Database).
Current resource reservations (henceforth called static reservations) are 
statically determined by the slave operator at slave start time, and individual 
frameworks have no authority to reserve resources themselves.
Dynamic reservations allow a framework to dynamically reserve offered 
resources, such that those resources will only be re-offered to the same 
framework (or other frameworks with the same role).
This is especially useful if the framework's task stored some state on the 
slave, and needs a guaranteed set of resources reserved so that it can 
re-launch a task on the same slave to recover that state.


 Dynamic Reservation
 ---

 Key: MESOS-2018
 URL: https://issues.apache.org/jira/browse/MESOS-2018
 Project: Mesos
  Issue Type: Epic
  Components: allocation, framework, master, slave
Reporter: Adam B
Assignee: Michael Park
  Labels: mesosphere, offer, persistence, reservations, resource, 
 stateful, storage

 This is a feature to provide better support for running stateful services on 
 Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed 
 Database), or MySQL (Local Database).
 Current resource reservations (henceforth called static reservations) are 
 statically determined by the slave operator at slave start time, and 
 individual frameworks have no authority to reserve resources themselves.
 Dynamic reservations allow a framework to dynamically reserve offered 
 resources, such that those resources will only be re-offered to the same 
 framework (or other frameworks with the same role).
 This is especially useful if the framework's task stored some state on the 
 slave, and needs a guaranteed set of resources reserved so that it can 
 re-launch a task on the same slave to recover that state.
 h3. Planned Stages
 [MESOS-2489|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2525) Missing information in Python interface launchTasks scheduler method

2015-03-20 Thread Itamar Ostricher (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371434#comment-14371434
 ] 

Itamar Ostricher commented on MESOS-2525:
-

I submitted a [review request|https://reviews.apache.org/r/32306/]

 Missing information in Python interface launchTasks scheduler method
 

 Key: MESOS-2525
 URL: https://issues.apache.org/jira/browse/MESOS-2525
 Project: Mesos
  Issue Type: Documentation
  Components: python api
Affects Versions: 0.21.0
Reporter: Itamar Ostricher
  Labels: documentation, newbie, patch
   Original Estimate: 1m
  Remaining Estimate: 1m

 The docstring of the launchTasks scheduler method in the Python API should 
 explicitly state that launching multiple tasks onto multiple offers is 
 supported only as long as all offers are from the same slave.
 See mailing list thread: 
 http://www.mail-archive.com/user@mesos.apache.org/msg02861.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2070) Implement simple slave recovery behavior for fetcher cache

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2070:
--
Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, 
Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 1 - 
1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Implement simple slave recovery behavior for fetcher cache
 --

 Key: MESOS-2070
 URL: https://issues.apache.org/jira/browse/MESOS-2070
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
  Labels: newbie
   Original Estimate: 6h
  Remaining Estimate: 6h

 Clean the fetcher cache completely upon slave restart/recovery. This 
 implements correct, albeit not ideal behavior. More efficient schemes that 
 restore knowledge about cached files or even resume downloads can be added 
 later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2425) TODO comment in mesos.proto is already implemented

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2425:
--
Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: 
Mesosphere Q1 Sprint 5 - 3/20)

 TODO comment in mesos.proto is already implemented
 --

 Key: MESOS-2425
 URL: https://issues.apache.org/jira/browse/MESOS-2425
 Project: Mesos
  Issue Type: Bug
  Components: general
Affects Versions: 0.20.1
Reporter: Aaron Bell
Assignee: Aaron Bell
Priority: Minor
  Labels: mesosphere
 Attachments: mesos-2425-1.diff


 These lines are redundant in mesos.proto, since CommandInfo is now 
 implemented:
 https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L169-L174
 I'm creating a patch with edits on comment lines only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2226) HookTest.VerifySlaveLaunchExecutorHook is flaky

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2226:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 
Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 
2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 HookTest.VerifySlaveLaunchExecutorHook is flaky
 ---

 Key: MESOS-2226
 URL: https://issues.apache.org/jira/browse/MESOS-2226
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Vinod Kone
Assignee: Kapil Arya
  Labels: flaky-test

 Observed this on internal CI
 {code}
 [ RUN  ] HookTest.VerifySlaveLaunchExecutorHook
 Using temporary directory '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME'
 I0114 18:51:34.659353  4720 leveldb.cpp:176] Opened db in 1.255951ms
 I0114 18:51:34.662112  4720 leveldb.cpp:183] Compacted db in 596090ns
 I0114 18:51:34.662364  4720 leveldb.cpp:198] Created db iterator in 177877ns
 I0114 18:51:34.662719  4720 leveldb.cpp:204] Seeked to beginning of db in 
 19709ns
 I0114 18:51:34.663010  4720 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 18208ns
 I0114 18:51:34.663312  4720 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0114 18:51:34.664266  4735 recover.cpp:449] Starting replica recovery
 I0114 18:51:34.664908  4735 recover.cpp:475] Replica is in EMPTY status
 I0114 18:51:34.667842  4734 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0114 18:51:34.669117  4735 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0114 18:51:34.677913  4735 recover.cpp:566] Updating replica status to 
 STARTING
 I0114 18:51:34.683157  4735 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 137939ns
 I0114 18:51:34.683507  4735 replica.cpp:323] Persisted replica status to 
 STARTING
 I0114 18:51:34.684013  4735 recover.cpp:475] Replica is in STARTING status
 I0114 18:51:34.685554  4738 replica.cpp:641] Replica in STARTING status 
 received a broadcasted recover request
 I0114 18:51:34.696512  4736 recover.cpp:195] Received a recover response from 
 a replica in STARTING status
 I0114 18:51:34.700552  4735 recover.cpp:566] Updating replica status to VOTING
 I0114 18:51:34.701128  4735 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 115624ns
 I0114 18:51:34.701478  4735 replica.cpp:323] Persisted replica status to 
 VOTING
 I0114 18:51:34.701817  4735 recover.cpp:580] Successfully joined the Paxos 
 group
 I0114 18:51:34.702569  4735 recover.cpp:464] Recover process terminated
 I0114 18:51:34.716439  4736 master.cpp:262] Master 
 20150114-185134-2272962752-57018-4720 (fedora-19) started on 
 192.168.122.135:57018
 I0114 18:51:34.716913  4736 master.cpp:308] Master only allowing 
 authenticated frameworks to register
 I0114 18:51:34.717136  4736 master.cpp:313] Master only allowing 
 authenticated slaves to register
 I0114 18:51:34.717488  4736 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME/credentials'
 I0114 18:51:34.718077  4736 master.cpp:357] Authorization enabled
 I0114 18:51:34.719238  4738 whitelist_watcher.cpp:65] No whitelist given
 I0114 18:51:34.719755  4737 hierarchical_allocator_process.hpp:285] 
 Initialized hierarchical allocator process
 I0114 18:51:34.722584  4736 master.cpp:1219] The newly elected leader is 
 master@192.168.122.135:57018 with id 20150114-185134-2272962752-57018-4720
 I0114 18:51:34.722865  4736 master.cpp:1232] Elected as the leading master!
 I0114 18:51:34.723310  4736 master.cpp:1050] Recovering from registrar
 I0114 18:51:34.723760  4734 registrar.cpp:313] Recovering registrar
 I0114 18:51:34.725229  4740 log.cpp:660] Attempting to start the writer
 I0114 18:51:34.727893  4739 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0114 18:51:34.728425  4739 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 114781ns
 I0114 18:51:34.728662  4739 replica.cpp:345] Persisted promised to 1
 I0114 18:51:34.731271  4741 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0114 18:51:34.733223  4734 replica.cpp:378] Replica received explicit 
 promise request for position 0 with proposal 2
 I0114 18:51:34.734076  4734 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 87441ns
 I0114 18:51:34.734441  4734 replica.cpp:679] Persisted action at 0
 I0114 18:51:34.740272  4739 replica.cpp:511] Replica received write request 
 for position 0
 I0114 18:51:34.740910  4739 leveldb.cpp:438] Reading position 

[jira] [Updated] (MESOS-2373) DRFSorter needs to distinguish resources from different slaves.

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2373:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 DRFSorter needs to distinguish resources from different slaves.
 ---

 Key: MESOS-2373
 URL: https://issues.apache.org/jira/browse/MESOS-2373
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Currently the {{DRFSorter}} aggregates total and allocated resources across 
 multiple slaves, which only works for scalar resources. We need to 
 distinguish resources from different slaves.
 Suppose we have 2 slaves and 1 framework. The framework is allocated all 
 resources from both slaves.
 {code}
 Resources slaveResources =
   Resources::parse(cpus:2;mem:512;ports:[31000-32000]).get();
 DRFSorter sorter;
 sorter.add(slaveResources);  // Add slave1 resources
 sorter.add(slaveResources);  // Add slave2 resources
 // Total resources in sorter at this point is
 // cpus(*):4; mem(*):1024; ports(*):[31000-32000].
 // The scalar resources get aggregated correctly but ports do not.
 sorter.add(F);
 // The 2 calls to allocated only works because we simply do:
 //   allocation[name] += resources;
 // without checking that the 'resources' is available in the total.
 sorter.allocated(F, slaveResources);
 sorter.allocated(F, slaveResources);
 // At this point, sorter.allocation(F) is:
 // cpus(*):4; mem(*):1024; ports(*):[31000-32000].
 {code}
 To provide some context, this issue came up while trying to reserve all 
 unreserved resources from every offer.
 {code}
 for (const Offer offer : offers) { 
   Resources unreserved = offer.resources().unreserved();
   Resources reserved = unreserved.flatten(role, Resource::FRAMEWORK); 
   Offer::Operation reserve;
   reserve.set_type(Offer::Operation::RESERVE); 
   reserve.mutable_reserve()-mutable_resources()-CopyFrom(reserved); 
  
   driver-acceptOffers({offer.id()}, {reserve}); 
 } 
 {code}
 Suppose the slave resources are the same as above:
 {quote}
 Slave1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 Slave2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 {quote}
 Initial (incorrect) total resources in the DRFSorter is:
 {quote}
 {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}}
 {quote}
 We receive 2 offers, 1 from each slave:
 {quote}
 Offer1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 Offer2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 {quote}
 At this point, the resources allocated for the framework is:
 {quote}
 {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}}
 {quote}
 After first {{RESERVE}} operation with Offer1:
 The allocated resources for the framework becomes:
 {quote}
 {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; 
 ports(role):\[31000-32000\]}}
 {quote}
 During second {{RESERVE}} operation with Offer2:
 {code:title=HierarchicalAllocatorProcess::updateAllocation}
   // ...
   FrameworkSorter* frameworkSorter =
 frameworkSorters[frameworks\[frameworkId\].role];
   Resources allocation = frameworkSorter-allocation(frameworkId.value());
   // Update the allocated resources.
   TryResources updatedAllocation = allocation.apply(operations);
   CHECK_SOME(updatedAllocation);
   // ...
 {code}
 {{allocation}} in the above code is:
 {quote}
 {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; 
 ports(role):\[31000-32000\]}}
 {quote}
 We try to {{apply}} a {{RESERVE}} operation and we fail to find 
 {{ports(\*):\[31000-32000\]}} which leads to the {{CHECK}} fail at 
 {{CHECK_SOME(updatedAllocation);}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2119) Add Socket tests

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2119:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  
(was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20)

 Add Socket tests
 

 Key: MESOS-2119
 URL: https://issues.apache.org/jira/browse/MESOS-2119
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Joris Van Remoortere

 Add more Socket specific tests to get coverage while doing libev to libevent 
 (w and wo SSL) move



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2069) Basic fetcher cache functionality

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2069:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  
(was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20)

 Basic fetcher cache functionality
 -

 Key: MESOS-2069
 URL: https://issues.apache.org/jira/browse/MESOS-2069
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
  Labels: fetcher, slave
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add a flag to CommandInfo URI protobufs that indicates that files downloaded 
 by the fetcher shall be cached in a repository. To be followed by MESOS-2057 
 for concurrency control.
 Also see MESOS-336 for the overall goals for the fetcher cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1806:
--
Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, 
Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 1 - 
1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Substituting etcd or ReplicatedLog for Zookeeper
 

 Key: MESOS-1806
 URL: https://issues.apache.org/jira/browse/MESOS-1806
 Project: Mesos
  Issue Type: Task
Reporter: Ed Ropple
Assignee: Cody Maloney
Priority: Minor

 adam_mesos   eropple: Could you also file a new JIRA for Mesos to drop ZK 
 in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
 that one.
 --
 Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2248) 0.22.0 release

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2248:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 0.22.0 release
 --

 Key: MESOS-2248
 URL: https://issues.apache.org/jira/browse/MESOS-2248
 Project: Mesos
  Issue Type: Epic
Reporter: Niklas Quarfot Nielsen
Assignee: Niklas Quarfot Nielsen

 Mesos release 0.22.0 will include the following major feature(s):
  - Module Hooks (MESOS-2060)
  - Disk quota isolation in Mesos containerizer (MESOS-1587 and MESOS-1588)
 Minor features and fixes:
  - Task labels (MESOS-2120)
  - Service discovery info for tasks and executors (MESOS-2208)
 - Docker containerizer able to recover when running in a container 
 (MESOS-2115)
  - Containerizer fixes (...)
  - Various bug fixes (...)
 Possible major features:
  - Container level network isolation (MESOS-1585)
  - Dynamic Reservations (MESOS-2018)
 This ticket will be used to track blockers to this release.
 For reference (per Jan 22nd) this has gone into Mesos since 0.21.1: 
 https://gist.github.com/nqn/76aeb41a555625659ed8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2215:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3)

 The Docker containerizer attempts to recover any task when checkpointing is 
 enabled, not just docker tasks.
 ---

 Key: MESOS-2215
 URL: https://issues.apache.org/jira/browse/MESOS-2215
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 0.21.0
Reporter: Steve Niemitz
Assignee: Timothy Chen

 Once the slave restarts and recovers the task, I see this error in the log 
 for all tasks that were recovered every second or so.  Note, these were NOT 
 docker tasks:
 W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
 for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
 thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
  of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
 inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
 with status 1 stderr = Error: No such image or container: 
 mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
 However the tasks themselves are still healthy and running.
 The slave was launched with --containerizers=mesos,docker
 -
 More info: it looks like the docker containerizer is a little too ambitious 
 about recovering containers, again this was not a docker task:
 I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
 '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
  of framework 20150109-161713-715350282-5050-290797-
 Looking into the source, it looks like the problem is that the 
 ComposingContainerize runs recover in parallel, but neither the docker 
 containerizer nor mesos containerizer check if they should recover the task 
 or not (because they were the ones that launched it).  Perhaps this needs to 
 be written into the checkpoint somewhere?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2018) Dynamic Reservation

2015-03-20 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-2018:

Description: 
This is a feature to provide better support for running stateful services on 
Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), 
or MySQL (Local Database).
Current resource reservations (henceforth called static reservations) are 
statically determined by the slave operator at slave start time, and individual 
frameworks have no authority to reserve resources themselves.
Dynamic reservations allow a framework to dynamically reserve offered 
resources, such that those resources will only be re-offered to the same 
framework (or other frameworks with the same role).
This is especially useful if the framework's task stored some state on the 
slave, and needs a guaranteed set of resources reserved so that it can 
re-launch a task on the same slave to recover that state.

h3. Planned Stages

1. MESOS-2489: Enable a framework to perform reservation operations.

The goal of this stage is to allow the framework to send back a 
Reserve/Unreserve operation which gets validated by the master and updates the 
allocator resources. The allocator's {{allocate}} logic is unchanged and 
therefore the resources get offered back to the framework's role rather than 
the specific framework. In the next stage, we'll teach the allocator to 
distinguish between role and framework reservations which will result in the 
resources being re-offered to the specific framework.

2. MESOS-2490: Enable the allocator to distinguish between role and framework 
reservations.

The goal of this stage is to teach the allocator to offer resources reserved 
for a framework to only be sent to the particular framework rather than the 
framework's role. This will involve updating the {{allocate}} function to 
select only the resources that are unreserved, role-reserved for the 
framework's role and framework-reserved for the framework.

3. MESOS-2491: Persist the reservation state on the slave.

The goal of this stage is to persist the reservation state on the slave. 
Currently the master knows to store the persistent volumes in the 
{{checkpointedResources}} data structure which gets sent to individual slaves 
to be checkpointed. We will update the master such that dynamically reserved 
resources are stored in the {{checkpointedResources}} as well. This stage also 
involves subtasks such as updating the slave re(register) logic to support 
slave re-starts.

  was:
This is a feature to provide better support for running stateful services on 
Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), 
or MySQL (Local Database).
Current resource reservations (henceforth called static reservations) are 
statically determined by the slave operator at slave start time, and individual 
frameworks have no authority to reserve resources themselves.
Dynamic reservations allow a framework to dynamically reserve offered 
resources, such that those resources will only be re-offered to the same 
framework (or other frameworks with the same role).
This is especially useful if the framework's task stored some state on the 
slave, and needs a guaranteed set of resources reserved so that it can 
re-launch a task on the same slave to recover that state.

h3. Planned Stages

1. MESOS-2489: Enable a framework to perform reservation operations.

The goal of this stage is to allow the framework to send back a 
Reserve/Unreserve operation which gets validated by the master and updates the 
allocator resources. The allocator's {{allocate}} logic is unchanged and 
therefore the resources get offered back to the framework's role rather than 
the specific framework. In the next stage, we'll teach the allocator to 
distinguish between role and framework reservations which will result in the 
resources being re-offered to the specific framework.

2. MESOS-2490: Enable the allocator to distinguish between role and framework 
reservations.

The goal of this stage of is to teach the allocator to offer resources reserved 
for a framework to only be sent to the particular framework rather than the 
framework's role. This will involve updating the {{allocate}} function to 
select only the resources that are unreserved, role-reserved for the 
framework's role and framework-reserved for the framework.

3. 


 Dynamic Reservation
 ---

 Key: MESOS-2018
 URL: https://issues.apache.org/jira/browse/MESOS-2018
 Project: Mesos
  Issue Type: Epic
  Components: allocation, framework, master, slave
Reporter: Adam B
Assignee: Michael Park
  Labels: mesosphere, offer, persistence, reservations, resource, 
 stateful, storage

 This is a feature to provide better support for running stateful services on 
 Mesos such as HDFS 

[jira] [Updated] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1806:
--
Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, 
Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 
2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3)

 Substituting etcd or ReplicatedLog for Zookeeper
 

 Key: MESOS-1806
 URL: https://issues.apache.org/jira/browse/MESOS-1806
 Project: Mesos
  Issue Type: Task
Reporter: Ed Ropple
Assignee: Cody Maloney
Priority: Minor

 adam_mesos   eropple: Could you also file a new JIRA for Mesos to drop ZK 
 in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
 that one.
 --
 Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2155) Make docker containerizer killing orphan containers optional

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2155:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 
3/20, Mesosphere Q1 Sprint 6 - 4/3)

 Make docker containerizer killing orphan containers optional
 

 Key: MESOS-2155
 URL: https://issues.apache.org/jira/browse/MESOS-2155
 Project: Mesos
  Issue Type: Improvement
  Components: docker
Reporter: Timothy Chen
Assignee: Timothy Chen

 Currently the docker containerizer on recover will kill containers that are 
 not recognized by the containerizer.
 We want to make this behavior optional as there are certain situations we 
 want to let the docker containers still continue to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1980) Benchmark RPC/s of linked Libprocess

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1980:
--
Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, 
Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 
1/23)

 Benchmark RPC/s of linked Libprocess
 

 Key: MESOS-1980
 URL: https://issues.apache.org/jira/browse/MESOS-1980
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere

 ibprocess has some performance bottlenecks. Implement a benchmark where we 
 can see regressions / improvements regarding RPCs performed per second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2016) docker_name_prefix is too generic

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2016:
--
Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, 
Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 
Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  
(was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere 
Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 
2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere 
Q1 Sprint 6 - 4/3)

 docker_name_prefix is too generic
 -

 Key: MESOS-2016
 URL: https://issues.apache.org/jira/browse/MESOS-2016
 Project: Mesos
  Issue Type: Bug
Reporter: Jay Buffington
Assignee: Timothy Chen

 From docker.hpp and docker.cpp:
 {code}
 // Prefix used to name Docker containers in order to distinguish those
 // created by Mesos from those created manually.
 extern std::string DOCKER_NAME_PREFIX;
 // TODO(benh): At some point to run multiple slaves we'll need to make
 // the Docker container name creation include the slave ID.
 string DOCKER_NAME_PREFIX = mesos-;
 {code}
 This name is too generic.  A common pattern in docker land is to run 
 everything in a container and use volume mounts to share sockets do RPC 
 between containers.  CoreOS has popularized this technique. 
 Inevitably, what people do is start a container named mesos-slave which 
 runs the docker containerizer recovery code which removes all containers that 
 start with mesos-  And then ask huh, why did my mesos-slave docker 
 container die? I don't see any error messages...
 Ideally, we should do what Ben suggested and add the slave id to the name 
 prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-905) Remove Framework.id in favor of FrameworkInfo.id

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-905:
-
Sprint: Mesosphere Q1 Sprint 6 - 4/3

 Remove Framework.id in favor of FrameworkInfo.id
 

 Key: MESOS-905
 URL: https://issues.apache.org/jira/browse/MESOS-905
 Project: Mesos
  Issue Type: Improvement
  Components: framework
Reporter: Adam B
Assignee: Adam B
Priority: Minor
  Labels: easyfix

 Framework.id currently holds the correct FrameworkId, but Framework also 
 contains a FrameworkInfo, and the FrameworkInfo.id is not necessarily set.
 I propose that we eliminate the Framework.id member variable and replace it 
 with a Framework.id() accessor that references Framework.FrameworkInfo.id and 
 ensure that it is correctly set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-540) Executor health checking.

2015-03-20 Thread Rohit Upadhyaya (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372532#comment-14372532
 ] 

Rohit Upadhyaya commented on MESOS-540:
---

Is this being worked on by someone else? I see a 'health-check' folder in src/ 
having some sample code added about a month back.

 Executor health checking.
 -

 Key: MESOS-540
 URL: https://issues.apache.org/jira/browse/MESOS-540
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler
  Labels: newbie, twitter

 We currently do not health check running executors.
 At Twitter, this has led to out-of-band health checking of executors for an 
 internal framework.
 For the Storm framework, this has led to out-of-band health checking via 
 ZooKeeper. Health checking would allow Storm to use finer grained executors 
 for better isolation.
 This also helps the Hadoop and Jenkins frameworks as well should health 
 checking be desired.
 As for implementation, I would propose adding a call on the Executor 
 interface:
 /**
  * Invoked by the ExecutorDriver to determine the health of the executor.
  * When this function returns, the Executor is considered healthy.
  */
 void heartbeat(ExecutorDriver* driver) = 0;
 The driver can then heartbeat periodically and kill when the Executor is not 
 responding to heartbeats. The driver should also detect the executor 
 deadlocking on any of the other callbacks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation

2015-03-20 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-2422:
--
Sprint: Twitter Mesos Q1 Sprint 4  (was: Twitter Mesos Q1 Sprint 4, Twitter 
Mesos Q1 Sprint 5)

 Use fq_codel qdisc for egress network traffic isolation
 ---

 Key: MESOS-2422
 URL: https://issues.apache.org/jira/browse/MESOS-2422
 Project: Mesos
  Issue Type: Task
Reporter: Cong Wang
Assignee: Cong Wang
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2057) Concurrency control for fetcher cache

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2057:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  
(was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20)

 Concurrency control for fetcher cache
 -

 Key: MESOS-2057
 URL: https://issues.apache.org/jira/browse/MESOS-2057
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
   Original Estimate: 96h
  Remaining Estimate: 96h

 Having added a URI flag to CommandInfo messages (in MESOS-2069) that 
 indicates caching, caching files downloaded by the fetcher in a repository, 
 now ensure that when a URI is cached, it is only ever downloaded once for 
 the same user on the same slave as long as the slave keeps running. 
 This even holds if multiple tasks request the same URI concurrently. If 
 multiple requests for the same URI occur, perform only one of them and reuse 
 the result. Make concurrent requests for the same URI wait for the one 
 download. 
 Different URIs from different CommandInfos can be downloaded concurrently.
 No cache eviction, cleanup or failover will be handled for now. Additional 
 tickets will be filed for these enhancements. (So don't use this feature in 
 production until the whole epic is complete.)
 Note that implementing this does not suffice for production use. This ticket 
 contains the main part of the fetcher logic, though. See the epic MESOS-336 
 for the rest of the features that lead to a fully functional fetcher cache.
 The proposed general approach is to keep all bookkeeping about what is in 
 which stage of being fetched and where it resides in the slave's 
 MesosContainerizerProcess, so that all concurrent access is disambiguated and 
 controlled by an actor (aka libprocess process).
 Depends on MESOS-2056 and MESOS-2069.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2050) InMemoryAuxProp plugin used by Authenticators results in SEGFAULT

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2050:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 InMemoryAuxProp plugin used by Authenticators results in SEGFAULT
 -

 Key: MESOS-2050
 URL: https://issues.apache.org/jira/browse/MESOS-2050
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
Reporter: Vinod Kone
Assignee: Till Toenshoff

 Observed this on ASF CI:
 Basically, as part of the recent Auth refactor for modules, the loading of 
 secrets is being done once per Authenticator Process instead of once in the 
 Master.  Since, InMemoryAuxProp plugin manipulates static variables (e.g, 
 'properties') it results in SEGFAULT when one Authenticator (e.g., for slave) 
 does load() while another Authenticator (e.g., for framework) does lookup(), 
 as both these methods manipulate static 'properties'.
 {code}
 [ RUN  ] MasterTest.LaunchDuplicateOfferTest
 Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp'
 I1104 03:37:55.523553 28363 leveldb.cpp:176] Opened db in 2.270387ms
 I1104 03:37:55.524250 28363 leveldb.cpp:183] Compacted db in 662527ns
 I1104 03:37:55.524276 28363 leveldb.cpp:198] Created db iterator in 4964ns
 I1104 03:37:55.524284 28363 leveldb.cpp:204] Seeked to beginning of db in 
 702ns
 I1104 03:37:55.524291 28363 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 450ns
 I1104 03:37:55.524333 28363 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I1104 03:37:55.524852 28384 recover.cpp:437] Starting replica recovery
 I1104 03:37:55.525188 28384 recover.cpp:463] Replica is in EMPTY status
 I1104 03:37:55.526577 28378 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I1104 03:37:55.527135 28378 master.cpp:318] Master 
 20141104-033755-3176252227-49988-28363 (proserpina.apache.org) started on 
 67.195.81.189:49988
 I1104 03:37:55.527180 28378 master.cpp:364] Master only allowing 
 authenticated frameworks to register
 I1104 03:37:55.527191 28378 master.cpp:369] Master only allowing 
 authenticated slaves to register
 I1104 03:37:55.527217 28378 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp/credentials'
 I1104 03:37:55.527451 28378 master.cpp:408] Authorization enabled
 I1104 03:37:55.528081 28384 master.cpp:126] No whitelist given. Advertising 
 offers for all slaves
 I1104 03:37:55.528548 28383 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I1104 03:37:55.528645 28388 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@67.195.81.189:49988
 I1104 03:37:55.529233 28388 master.cpp:1258] The newly elected leader is 
 master@67.195.81.189:49988 with id 20141104-033755-3176252227-49988-28363
 I1104 03:37:55.529266 28388 master.cpp:1271] Elected as the leading master!
 I1104 03:37:55.529289 28388 master.cpp:1089] Recovering from registrar
 I1104 03:37:55.529311 28385 recover.cpp:554] Updating replica status to 
 STARTING
 I1104 03:37:55.529500 28384 registrar.cpp:313] Recovering registrar
 I1104 03:37:55.530037 28383 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 497965ns
 I1104 03:37:55.530083 28383 replica.cpp:320] Persisted replica status to 
 STARTING
 I1104 03:37:55.530335 28387 recover.cpp:463] Replica is in STARTING status
 I1104 03:37:55.531343 28381 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I1104 03:37:55.531739 28384 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I1104 03:37:55.532168 28379 recover.cpp:554] Updating replica status to VOTING
 I1104 03:37:55.532572 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 293974ns
 I1104 03:37:55.532594 28381 replica.cpp:320] Persisted replica status to 
 VOTING
 I1104 03:37:55.532790 28390 recover.cpp:568] Successfully joined the Paxos 
 group
 I1104 03:37:55.533107 28390 recover.cpp:452] Recover process terminated
 I1104 03:37:55.533604 28382 log.cpp:656] Attempting to start the writer
 I1104 03:37:55.534840 28381 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I1104 03:37:55.535188 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 321021ns
 I1104 03:37:55.535212 28381 replica.cpp:342] Persisted promised to 1
 I1104 03:37:55.535893 28378 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I1104 

[jira] [Updated] (MESOS-2110) Configurable Ping Timeouts

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2110:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  
(was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20)

 Configurable Ping Timeouts
 --

 Key: MESOS-2110
 URL: https://issues.apache.org/jira/browse/MESOS-2110
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Adam B
Assignee: Adam B
  Labels: master, network, slave, timeout

 After a series of ping-failures, the master considers the slave lost and 
 calls shutdownSlave, requiring such a slave that reconnects to kill its tasks 
 and re-register as a new slaveId. On the other side, after a similar timeout, 
 the slave will consider the master lost and try to detect a new master. These 
 timeouts are currently hardcoded constants (5 * 15s), which may not be 
 well-suited for all scenarios.
 - Some clusters may tolerate a longer slave process restart period, and 
 wouldn't want tasks to be killed upon reconnect.
 - Some clusters may have higher-latency networks (e.g. cross-datacenter, or 
 for volunteer computing efforts), and would like to tolerate longer periods 
 without communication.
 We should provide flags/mechanisms on the master to control its tolerance for 
 non-communicative slaves, and (less importantly?) on the slave to tolerate 
 missing masters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2160) Add support for allocator modules

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2160:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 
Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 
2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Add support for allocator modules
 -

 Key: MESOS-2160
 URL: https://issues.apache.org/jira/browse/MESOS-2160
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Alexander Rukletsov
  Labels: mesosphere

 Currently Mesos supports only the DRF allocator, changing which requires 
 hacking Mesos source code, which, in turn, sets a high entry barrier. 
 Allocator modules give an easy possibility to tweak resource allocation 
 policy. This will enable swapping allocation policies without the necessity 
 to edit Mesos source code. Custom allocators may be written by everybody and 
 does not need be distributed together with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2074) Fetcher cache test fixture

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2074:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Fetcher cache test fixture
 --

 Key: MESOS-2074
 URL: https://issues.apache.org/jira/browse/MESOS-2074
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
   Original Estimate: 72h
  Remaining Estimate: 72h

 To accelerate providing good test coverage for the fetcher cache (MESOS-336), 
 we can provide a framework that canonicalizes creating and running a number 
 of tasks and allows easy parametrization with combinations of the following:
 - whether to cache or not
 - whether make what has been downloaded executable or not
 - whether to extract from an archive or not
 - whether to download from a file system, http, or...
 We can create a simple HHTP server in the test fixture to support the latter.
 Furthermore, the tests need to be robust wrt. varying numbers of StatusUpdate 
 messages. An accumulating update message sink that reports the final state is 
 needed.
 All this has already been programmed in this patch, just needs to be rebased:
 https://reviews.apache.org/r/21316/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2165) When cyrus sasl MD5 isn't installed configure passes, tests fail without any output

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2165:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 When cyrus sasl MD5 isn't installed configure passes, tests fail without any 
 output
 ---

 Key: MESOS-2165
 URL: https://issues.apache.org/jira/browse/MESOS-2165
 Project: Mesos
  Issue Type: Bug
Reporter: Cody Maloney
Assignee: Till Toenshoff
  Labels: mesosphere

 Sample Dockerfile to make such a host:
 {code}
 FROM centos:centos7
 RUN yum install -y epel-release gcc python-devel
 RUN yum install -y python-pip
 RUN yum install -y rpm-build redhat-rpm-config autoconf make gcc gcc-c++ 
 patch libtool git python-devel ruby-devel java-1.7.0-openjdk-devel zlib-devel 
 libcurl-devel openssl-devel cyrus-sasl-devel rubygems apr-devel 
 apr-util-devel subversion-devel maven libselinux-python
 {code}
 Use: 'docker run -i -t imagename /bin/bash' to run the image, get a shell 
 inside where you can 'git clone' mesos and build/run the tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2072) Fetcher cache eviction

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2072:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 
Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 
2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Fetcher cache eviction
 --

 Key: MESOS-2072
 URL: https://issues.apache.org/jira/browse/MESOS-2072
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
   Original Estimate: 336h
  Remaining Estimate: 336h

 Delete files from the fetcher cache so that a given cache size is never 
 exceeded. Succeed in doing so while concurrent downloads are on their way and 
 new requests are pouring in.
 Idea: measure the size of each download before it begins, make enough room 
 before the download. This means that only download mechanisms that divulge 
 the size before the main download will be supported. AFAWK, those in use so 
 far have this property. 
 The calculation of how much space to free needs to be under concurrency 
 control, accumulating all space needed for competing, incomplete download 
 requests. (The Python script that performs fetcher caching for Aurora does 
 not seem to implement this. See 
 https://gist.github.com/zmanji/f41df77510ef9d00265a, imagine several of these 
 programs running concurrently, each one's _cache_eviction() call succeeding, 
 each perceiving the SAME free space being available.)
 Ultimately, a conflict resolution strategy is needed if just the downloads 
 underway already exceed the cache capacity. Then, as a fallback, direct 
 download into the work directory will be used for some tasks. TBD how to pick 
 which task gets treated how. 
 At first, only support copying of any downloaded files to the work directory 
 for task execution. This isolates the task life cycle after starting a task 
 from cache eviction considerations. 
 (Later, we can add symbolic links that avoid copying. But then eviction of 
 fetched files used by ongoing tasks must be blocked, which adds complexity. 
 another future extension is MESOS-1667 Extract from URI while downloading 
 into work dir).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2351) Enable label and environment decorators (hooks) to remove label and environment entries

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2351:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Enable label and environment decorators (hooks) to remove label and 
 environment entries
 ---

 Key: MESOS-2351
 URL: https://issues.apache.org/jira/browse/MESOS-2351
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Niklas Quarfot Nielsen

 We need to change the semantics of decorators to be able to not only add 
 labels and environment variables, but also remove them.
 The change is fairly small. The hook manager (and call site) use CopyFrom 
 instead of MergeFrom and hook implementors pass on the labels and environment 
 from task and executor commands respectively.
 In the future, we can tag labels such that only labels belonging to a hook 
 type (across master and slave) can be inspected and changed. For now, the 
 active hooks are selected by the operator and therefore be trusted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2500) Doxygen setup for libprocess

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2500:
--
Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: 
Mesosphere Q1 Sprint 5 - 3/20)

 Doxygen setup for libprocess
 

 Key: MESOS-2500
 URL: https://issues.apache.org/jira/browse/MESOS-2500
 Project: Mesos
  Issue Type: Documentation
  Components: libprocess
Reporter: Bernd Mathiske
Assignee: Joerg Schad
   Original Estimate: 48h
  Remaining Estimate: 48h

 Goals: 
 - Initial doxygen setup. 
 - Enable interested developers to generate already available doxygen content 
 locally in their workspace and view it.
 - Form the basis for future contributions of more doxygen content.
 1. Devise a way to use Doxygen with Mesos source code. (For example, solve 
 this by adding optional brew/apt-get installation to the Getting Started 
 doc.)
 2. Create a make target for libprocess documentation that can be manually 
 triggered.
 3. Create initial library top level documentation.
 4. Enhance one header file with Doxygen. Make sure the generated output has 
 all necessary links to navigate from the lib to the file and back, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2333) Securing Sandboxes via Filebrowser Access Control

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2333:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Securing Sandboxes via Filebrowser Access Control
 -

 Key: MESOS-2333
 URL: https://issues.apache.org/jira/browse/MESOS-2333
 Project: Mesos
  Issue Type: Improvement
  Components: security
Reporter: Adam B
Assignee: Alexander Rojas
  Labels: authorization, filebrowser, mesosphere, security

 As it stands now, anybody with access to the master or slave web UI can use 
 the filebrowser to view the contents of any attached/mounted paths on the 
 master or slave. Currently, the attached paths include master and slave logs 
 as well as executor/task sandboxes. While there's a chance that the master 
 and slave logs could contain sensitive information, it's much more likely 
 that sandboxes could contain customer data or other files that should not be 
 globally accessible. Securing the sandboxes is the primary goal of this 
 ticket.
 There are four filebrowser endpoints: browse, read, download, and debug. Here 
 are some potential solutions.
 1) We could easily provide flags that globally enable/disable each endpoint, 
 allowing coarse-grained access control. This might be a reasonable 
 short-term plan. We would also want to update the web UIs to display an 
 Access Denied error, rather than showing links that open up blank pailers.
 2) Each master and slave handles is own authn/authz. Slaves will need to have 
 an authenticator, and there must be a way to provide each node with 
 credentials and ACLs, and keep these in sync across the cluster.
 3) Filter all slave communications through the master(s), which already has 
 credentials and ACLs. We'll have to restrict access to the filebrowser (and 
 other?) endpoints to the (leading?) master. Then the master can perform the 
 authentication and authorization, only passing the request on to the slave if 
 auth succeeds.
 3a) The slave returns the browse/read/download response back through the 
 master. This could be a network bottleneck.
 3b) Upon authn/z success, the master redirects the request to the appropriate 
 slave, which will send the response directly back to the requester.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2372) Test suite for verifying compatibility between Mesos components

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2372:
--
Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: 
Mesosphere Q1 Sprint 5 - 3/20)

 Test suite for verifying compatibility between Mesos components
 ---

 Key: MESOS-2372
 URL: https://issues.apache.org/jira/browse/MESOS-2372
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Niklas Quarfot Nielsen

 While our current unit/integration test suite catches functional bugs, it 
 doesn't catch compatibility bugs (e.g, MESOS-2371). This is really crucial to 
 provide operators the ability to do seamless upgrades on live clusters.
 We should have a test suite / framework (ideally running on CI vetting each 
 review on RB) that tests upgrade paths between master, slave, scheduler and 
 executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2215:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 The Docker containerizer attempts to recover any task when checkpointing is 
 enabled, not just docker tasks.
 ---

 Key: MESOS-2215
 URL: https://issues.apache.org/jira/browse/MESOS-2215
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 0.21.0
Reporter: Steve Niemitz
Assignee: Timothy Chen

 Once the slave restarts and recovers the task, I see this error in the log 
 for all tasks that were recovered every second or so.  Note, these were NOT 
 docker tasks:
 W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
 for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
 thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
  of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
 inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
 with status 1 stderr = Error: No such image or container: 
 mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
 However the tasks themselves are still healthy and running.
 The slave was launched with --containerizers=mesos,docker
 -
 More info: it looks like the docker containerizer is a little too ambitious 
 about recovering containers, again this was not a docker task:
 I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
 '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
  of framework 20150109-161713-715350282-5050-290797-
 Looking into the source, it looks like the problem is that the 
 ComposingContainerize runs recover in parallel, but neither the docker 
 containerizer nor mesos containerizer check if they should recover the task 
 or not (because they were the ones that launched it).  Perhaps this needs to 
 be written into the checkpoint somewhere?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2161) AbstractState JNI check fails for Marathon framework

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2161:
--
Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: 
Mesosphere Q1 Sprint 5 - 3/20)

 AbstractState JNI check fails for Marathon framework
 

 Key: MESOS-2161
 URL: https://issues.apache.org/jira/browse/MESOS-2161
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Mesos 0.21.0
 Marathon 0.7.5
 Fedora 20
Reporter: Matthew Sanders
Assignee: Joris Van Remoortere
 Attachments: mesos_core_dump_gdb.txt.bz2


 I've recently upgraded to mesos 0.21.0 and now it seems that every few 
 minutes or so I see the following error, which kills marathon. 
 Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:42,064] INFO 10.133.128.26 -  -  [26/Nov/2014:00:12:41 +] GET 
 /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (X11; Linux 
 x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 
 (mesosphere.chaos.http.ChaosRequestLog:15)
 Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:42,238] INFO 10.133.128.26 -  -  [26/Nov/2014:00:12:42 +] GET 
 /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 (X11; 
 Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 
 (mesosphere.chaos.http.ChaosRequestLog:15)
 Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:42,961] INFO 10.192.221.95 -  -  [26/Nov/2014:00:12:42 +] GET 
 /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (Macintosh; 
 Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) 
 Chrome/39.0.2171.65 Safari/537...
 Nov 25 18:12:43 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:43,032] INFO 10.192.221.95 -  -  [26/Nov/2014:00:12:42 +] GET 
 /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 
 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) 
 Chrome/39.0.2171.65 Safari...
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: F1125 
 18:12:44.146260  5897 check.hpp:79] Check failed: f.isReady()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: *** Check 
 failure stack trace: ***
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2b17c  google::LogMessage::Fail()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2b0d5  google::LogMessage::SendToLog()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2aab3  google::LogMessage::Flush()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2da3b  google::LogMessageFatal::~LogMessageFatal()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a1ea64  _checkReady()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a1d43b  Java_org_apache_mesos_state_AbstractState__1_1names_1get
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f81f644ca70  (unknown)
 Nov 25 18:12:44 gianthornet.trading.imc.intra systemd[1]: marathon.service: 
 main process exited, code=killed, status=6/ABRT
 Here's the command that mesos-master is being run with
 /usr/local/sbin/mesos-master 
 --zk=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
  --port=5050 --log_dir=/var/log/mesos --quorum=1 --work_dir=/var/lib/mesos
 Here's the command that the slave is running with:
 /usr/local/sbin/mesos-slave 
 --master=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
  --log_dir=/var/log/mesos --containerizers=docker,mesos 
 --executor_registration_timeout=5mins 
 --attributes=country:us;datacenter:njl3;environment:dev;region:amer;timezone:America/Chicago
 I realize this could also be filed to marathon, but it sort of looks like a 
 c++ issue to me, which is why I came here to post this. Any help would be 
 greatly appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2337) __init__.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2337:
--
Sprint: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, 
Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20)

 __init__.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos
 --

 Key: MESOS-2337
 URL: https://issues.apache.org/jira/browse/MESOS-2337
 Project: Mesos
  Issue Type: Bug
  Components: build, python api
Reporter: Kapil Arya
Assignee: Kapil Arya
Priority: Blocker

 When doing a make install, the src/python/native/src/mesos/__init__.py file 
 is not getting installed in ${PREFIX}/lib/pythonX.Y/site-packages/mesos/.  
 This makes it impossible to do the following import when PYTHONPATH is set to 
 the site-packages directory.
 {code}
 import mesos.interface.mesos_pb2
 {code}
 The directories 
 `${PREFIX}/lib/pythonX.Y/site-packages/mesos/{interface,native}/` do have 
 their corresponding `__init__.py` files.
 Reproducing the bug:
 ../configure --prefix=$HOME/test-install  make install



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2016) docker_name_prefix is too generic

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2016:
--
Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, 
Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 
Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, 
Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere 
Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 
2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20)

 docker_name_prefix is too generic
 -

 Key: MESOS-2016
 URL: https://issues.apache.org/jira/browse/MESOS-2016
 Project: Mesos
  Issue Type: Bug
Reporter: Jay Buffington
Assignee: Timothy Chen

 From docker.hpp and docker.cpp:
 {code}
 // Prefix used to name Docker containers in order to distinguish those
 // created by Mesos from those created manually.
 extern std::string DOCKER_NAME_PREFIX;
 // TODO(benh): At some point to run multiple slaves we'll need to make
 // the Docker container name creation include the slave ID.
 string DOCKER_NAME_PREFIX = mesos-;
 {code}
 This name is too generic.  A common pattern in docker land is to run 
 everything in a container and use volume mounts to share sockets do RPC 
 between containers.  CoreOS has popularized this technique. 
 Inevitably, what people do is start a container named mesos-slave which 
 runs the docker containerizer recovery code which removes all containers that 
 start with mesos-  And then ask huh, why did my mesos-slave docker 
 container die? I don't see any error messages...
 Ideally, we should do what Ben suggested and add the slave id to the name 
 prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2157) Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2157:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 
Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 
2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints
 

 Key: MESOS-2157
 URL: https://issues.apache.org/jira/browse/MESOS-2157
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Niklas Quarfot Nielsen
Assignee: Alexander Rojas
Priority: Trivial
  Labels: mesosphere, newbie

 master/state.json exports the entire state of the cluster and can, for large 
 clusters, become massive (tens of megabytes of JSON).
 Often, a client only need information about subsets of the entire state, for 
 example all connected slaves, or information (registration info, tasks, etc) 
 belonging to a particular framework.
 We can partition state.json into many smaller endpoints, but for starters, 
 being able to get slave information and tasks information per framework would 
 be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2205) Add user documentation for reservations

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2205:
--
Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  (was: 
Mesosphere Q1 Sprint 5 - 3/20)

 Add user documentation for reservations
 ---

 Key: MESOS-2205
 URL: https://issues.apache.org/jira/browse/MESOS-2205
 Project: Mesos
  Issue Type: Documentation
  Components: documentation, framework
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Add a user guide for reservations which describes basic usage of them, how 
 ACLs are used to specify who can unreserve whose resources, and few advanced 
 usage cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2115) Improve recovering Docker containers when slave is contained

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2115:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  
(was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20)

 Improve recovering Docker containers when slave is contained
 

 Key: MESOS-2115
 URL: https://issues.apache.org/jira/browse/MESOS-2115
 Project: Mesos
  Issue Type: Epic
  Components: docker
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: docker

 Currently when docker containerizer is recovering it checks the checkpointed 
 executor pids to recover which containers are still running, and remove the 
 rest of the containers from docker ps that isn't recognized.
 This is problematic when the slave itself was in a docker container, as when 
 the slave container dies all the forked processes are removed as well, so the 
 checkpointed executor pids are no longer valid.
 We have to assume the docker containers might be still running even though 
 the checkpointed executor pids are not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1831) Master should send PingSlaveMessage instead of PING

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1831:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 
Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 
2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20)

 Master should send PingSlaveMessage instead of PING
 -

 Key: MESOS-1831
 URL: https://issues.apache.org/jira/browse/MESOS-1831
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Adam B
  Labels: mesosphere

 In 0.21.0 master sends PING message with an embedded PingSlaveMessage for 
 backwards compatibility (https://reviews.apache.org/r/25867/).
 In 0.22.0, master should send PingSlaveMessage directly instead of PING.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2375) Remove the checkpoint variable entirely from slave/flags.hpp

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2375:
--
Sprint: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, 
Mesosphere Q1 Sprint 6 - 4/3  (was: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20)

 Remove the checkpoint variable entirely from slave/flags.hpp
 

 Key: MESOS-2375
 URL: https://issues.apache.org/jira/browse/MESOS-2375
 Project: Mesos
  Issue Type: Task
Reporter: Joerg Schad
Assignee: Joerg Schad
  Labels: checkpoint, mesosphere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2108) Add configure flag or environment variable to enable SSL/libevent Socket

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2108:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  
(was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20)

 Add configure flag or environment variable to enable SSL/libevent Socket
 

 Key: MESOS-2108
 URL: https://issues.apache.org/jira/browse/MESOS-2108
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Joris Van Remoortere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2085) Add support encrypted and non-encrypted communication in parallel for cluster upgrade

2015-03-20 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2085:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3  
(was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20)

 Add support encrypted and non-encrypted communication in parallel for cluster 
 upgrade
 -

 Key: MESOS-2085
 URL: https://issues.apache.org/jira/browse/MESOS-2085
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Joris Van Remoortere

 During cluster upgrade from non-encrypted to encrypted communication, we need 
 to support an interim where:
 1) A master can have connections to both encrypted and non-encrypted slaves
 2) A slave that supports encrypted communication connects to a master that 
 has not yet been upgraded.
 3) Frameworks are encrypted but the master has not been upgraded yet.
 4) Master has been upgraded but frameworks haven't.
 5) A slave process has upgraded but running executor processes haven't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2524) Mesos-containerizer not linked from main documentation page.

2015-03-20 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371797#comment-14371797
 ] 

James Peach commented on MESOS-2524:


FWIW, I'm happy to make this change in order to learn the contribution process 
;)

 Mesos-containerizer not linked from main documentation page.
 

 Key: MESOS-2524
 URL: https://issues.apache.org/jira/browse/MESOS-2524
 Project: Mesos
  Issue Type: Documentation
Reporter: Joerg Schad
Assignee: Joerg Schad
Priority: Minor
   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 Is there any reason that the mesos-containerizer 
 (http://mesos.apache.org/documentation/latest/mesos-containerizer/) 
 documentation is not linked from the main documentation page 
 (http://mesos.apache.org/documentation/latest/)? Both docker and external 
 conterizer pages are linked from here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371800#comment-14371800
 ] 

Jie Yu commented on MESOS-2367:
---

[~idownes] Vinod and I chatted about this, and he proposed a solution I think 
is pretty clean:

1) We modify the launcher recover interface to return a list of container IDs 
that it believes are orphans.
2) Isolator does not cleanup orphans during recovery, it simply recovers them.
3) The Mesos containerizer will take the list of orphans from launcher recovery 
and destroy them explicitly

That matches the logics in the steady state. It's still a non trivial task 
because we need to modify the recovery path for each isolator. [~idownes], what 
do you think?

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371855#comment-14371855
 ] 

Ian Downes commented on MESOS-2367:
---

Perhaps you're suggesting the containerizer has the logic (in this particular 
case) of increasing the cpu and then trying to destroy the container?

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371846#comment-14371846
 ] 

Ian Downes edited comment on MESOS-2367 at 3/20/15 6:46 PM:


This is similar to what I'm proposing but avoids the real issue of how to 
handle orphans that cannot be destroyed? i.e., what does the containerizer do 
with the orphans: (3) says it destroys them but this ultimately calls the same 
code that's failing to destroy a container now?


was (Author: idownes):
This is similar to what I'm proposing but skirts the real issue of how to 
handle orphans that cannot be destroyed? i.e., what does the containerizer do 
with the orphans: (3) says it destroys them but this ultimately calls the same 
code that's failing to destroy a container now?

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371846#comment-14371846
 ] 

Ian Downes commented on MESOS-2367:
---

This is similar to what I'm proposing but skirts the real issue of how to 
handle orphans that cannot be destroyed? i.e., what does the containerizer do 
with the orphans: (3) says it destroys them but this ultimately calls the same 
code that's failing to destroy a container now?

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371864#comment-14371864
 ] 

Jie Yu commented on MESOS-2367:
---

[~idownes] I think having orphans that we cannot cleanup is better than a 
flapping slave. We can address the issue of how to cleanup those orphans 
reliably later.

Yeah, I am more comfortable letting the containerizer increase the cpu because 
it has the knowledge about what isolators and launcher are used.

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2526) MESOS_master in mesos-slave-env.sh is not work well by mesos-slave

2015-03-20 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371693#comment-14371693
 ] 

Littlestar edited comment on MESOS-2526 at 3/20/15 5:32 PM:


I changed  /usr/sbin/mesos-daemon.sh, now it shows detail message in console.
{noformat}

#nohup ${exec_prefix}/sbin/${PROGRAM} ${@} /dev/null /dev/null 21 
 ${exec_prefix}/sbin/${PROGRAM} ${@}

Failed to create a containerizer: Could not create MesosContainerizer: Could 
not create isolator cgroups/cpu: Failed to prepare hierarchy for cpu subsystem: 
Failed to mount cgroups hierarchy at '/sys/fs/cgroup/cpu': Failed to create 
directory '/sys/fs/cgroup/cpu': No such file or directory

{noformat}



was (Author: cnstar9988):
I changed  /usr/sbin/mesos-daemon.sh, now it shows detail message in console.
#nohup ${exec_prefix}/sbin/${PROGRAM} ${@} /dev/null /dev/null 21 
 ${exec_prefix}/sbin/${PROGRAM} ${@}

Failed to create a containerizer: Could not create MesosContainerizer: Could 
not create isolator cgroups/cpu: Failed to prepare hierarchy for cpu subsystem: 
Failed to mount cgroups hierarchy at '/sys/fs/cgroup/cpu': Failed to create 
directory '/sys/fs/cgroup/cpu': No such file or directory

 MESOS_master in  mesos-slave-env.sh is not work well by mesos-slave
 ---

 Key: MESOS-2526
 URL: https://issues.apache.org/jira/browse/MESOS-2526
 Project: Mesos
  Issue Type: Bug
  Components: framework
Affects Versions: 0.21.1
Reporter: Littlestar
Priority: Minor

 mesos start-cluster.sh
 master node start ok, but slave node start fail.
 echo slave node has log message, but nothing seens fail.
 no mesos process running on slave node.
 ==
 masters and slaves file is ok.
 the following work ok.
 export MESOS_master=mymaster:5050
 mesos-slave
 why does it work fail in mesos-slave-env.sh?
 Thanks
 -bash-4.1# cat /usr/etc/mesos/mesos-slave-env.sh
 {noformat}
 # This file contains environment variables that are passed to mesos-slave.
 # To get a description of all options run mesos-slave --help; any option
 # supported as a command-line option is also supported as an environment
 # variable.
 # You must at least set MESOS_master.
 # The mesos master URL to contact. Should be host:port for
 # non-ZooKeeper based masters, otherwise a zk:// or file:// URL.
 export MESOS_master=mymaster:5050
 # Other options you're likely to want to set:
 export MESOS_log_dir=/var/log/mesos
 export MESOS_work_dir=/var/run/mesos
 export MESOS_isolation=cgroups
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371702#comment-14371702
 ] 

Jie Yu commented on MESOS-2367:
---

I agree that we should not change the contract between launcher and isolators. 
In other words, isolators should not perform cleanup if there are still 
processes running. Otherwise, it'll create more complex issues that will be 
very hard to triage. For instance, if the network isolator cleans up the 
virtual links before processes running inside the container are killed, we 
could run into issues later when a new task tries to connect to the slave using 
the same local port (we've seen that before and it's painful to debug).

[~idownes], I don't quite understand your proposal. Correct me if I were wrong:

{quote} Make orphan clean up failures non-fatal so the slave will start and we 
gain control over running tasks. {quote}

Are you saying that we don't fail the slave when launcher fails to cleanup 
orphans? Well, that means we still need to check the impl. (i.e., the recovery 
path) of each isolator to make sure they don't cleanup orphans if the launcher 
recovery does not report a full success. The isolators need to maintain data 
structure for those orphans so that the resources used by those orphans are not 
allocated to new containers. For instance, the network isolator should not 
allocate the ephemeral port ranges used by those orphans to new containers. 
Well, I think that's doable and we just need to be careful.

Once we've done the above, I don't think we need 3 anymore, right?

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2527) Add default bind to socket

2015-03-20 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371711#comment-14371711
 ] 

Joris Van Remoortere commented on MESOS-2527:
-

https://reviews.apache.org/r/28485/

 Add default bind to socket
 --

 Key: MESOS-2527
 URL: https://issues.apache.org/jira/browse/MESOS-2527
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere
Priority: Minor
  Labels: libprocess
 Fix For: 0.23.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2526) MESOS_master in mesos-slave-env.sh is not work well by mesos-slave

2015-03-20 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371693#comment-14371693
 ] 

Littlestar commented on MESOS-2526:
---

I changed  /usr/sbin/mesos-daemon.sh, now it shows detail message in console.
#nohup ${exec_prefix}/sbin/${PROGRAM} ${@} /dev/null /dev/null 21 
 ${exec_prefix}/sbin/${PROGRAM} ${@}

Failed to create a containerizer: Could not create MesosContainerizer: Could 
not create isolator cgroups/cpu: Failed to prepare hierarchy for cpu subsystem: 
Failed to mount cgroups hierarchy at '/sys/fs/cgroup/cpu': Failed to create 
directory '/sys/fs/cgroup/cpu': No such file or directory

 MESOS_master in  mesos-slave-env.sh is not work well by mesos-slave
 ---

 Key: MESOS-2526
 URL: https://issues.apache.org/jira/browse/MESOS-2526
 Project: Mesos
  Issue Type: Bug
  Components: framework
Affects Versions: 0.21.1
Reporter: Littlestar
Priority: Minor

 mesos start-cluster.sh
 master node start ok, but slave node start fail.
 echo slave node has log message, but nothing seens fail.
 no mesos process running on slave node.
 ==
 masters and slaves file is ok.
 the following work ok.
 export MESOS_master=mymaster:5050
 mesos-slave
 why does it work fail in mesos-slave-env.sh?
 Thanks
 -bash-4.1# cat /usr/etc/mesos/mesos-slave-env.sh
 {noformat}
 # This file contains environment variables that are passed to mesos-slave.
 # To get a description of all options run mesos-slave --help; any option
 # supported as a command-line option is also supported as an environment
 # variable.
 # You must at least set MESOS_master.
 # The mesos master URL to contact. Should be host:port for
 # non-ZooKeeper based masters, otherwise a zk:// or file:// URL.
 export MESOS_master=mymaster:5050
 # Other options you're likely to want to set:
 export MESOS_log_dir=/var/log/mesos
 export MESOS_work_dir=/var/run/mesos
 export MESOS_isolation=cgroups
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371768#comment-14371768
 ] 

Ian Downes commented on MESOS-2367:
---

The difficulty here is that the containerizer builds a list of executors that 
it believes should be running and then the launcher and each of isolators 
independently clean up any other state they discover. Success is binary, they 
either recovery everything and clean up everything or fail.

Instead, I think the launcher should provide more information about what is was 
able to do:
a) attempt recovery of containers assumed to be running and report those it 
could not recover
b) attempt clean up of other containers and report those it could not clean up

Isolators should do a similar thing, as you've described. The critical part 
we're missing is that if clean up fails (including if the launcher could not 
clean up the container) the isolator should account for the resources of that 
container, e.g., for things like ephemeral port ranges.

After recovery the containerizer should know about running containers recovered 
correctly and also containers that have not been fully destroyed (the union of 
those from launcher + isolators). This would need to be handled by the slave

The key point is that an incomplete clean up of a container is not inherently 
bad, so long as we know about it and we can account for the resources it holds. 
Furthermore, we must make this visible with appropriate counters.


 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2526) MESOS_master in mesos-slave-env.sh is not work well by mesos-slave

2015-03-20 Thread Littlestar (JIRA)
Littlestar created MESOS-2526:
-

 Summary: MESOS_master in  mesos-slave-env.sh is not work well by 
mesos-slave
 Key: MESOS-2526
 URL: https://issues.apache.org/jira/browse/MESOS-2526
 Project: Mesos
  Issue Type: Bug
  Components: framework
Affects Versions: 0.21.1
Reporter: Littlestar
Priority: Minor


mesos start-cluster.sh

master node start ok, but slave node start fail.
echo slave node has log message, but nothing seens fail.
no mesos process running on slave node.
==
the following work ok.
export MESOS_master=mymaster:5050
mesos-slave

why does it work fail in mesos-slave-env.sh?
Thanks

-bash-4.1# cat /usr/etc/mesos/mesos-slave-env.sh

{noformat}

# This file contains environment variables that are passed to mesos-slave.
# To get a description of all options run mesos-slave --help; any option
# supported as a command-line option is also supported as an environment
# variable.

# You must at least set MESOS_master.

# The mesos master URL to contact. Should be host:port for
# non-ZooKeeper based masters, otherwise a zk:// or file:// URL.
export MESOS_master=mymaster:5050

# Other options you're likely to want to set:
export MESOS_log_dir=/var/log/mesos
export MESOS_work_dir=/var/run/mesos
export MESOS_isolation=cgroups
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2527) Add default bind to socket

2015-03-20 Thread Joris Van Remoortere (JIRA)
Joris Van Remoortere created MESOS-2527:
---

 Summary: Add default bind to socket
 Key: MESOS-2527
 URL: https://issues.apache.org/jira/browse/MESOS-2527
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere
Priority: Minor
 Fix For: 0.23.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2523) Executor directory has incorrect permissions

2015-03-20 Thread Michael Ngo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371556#comment-14371556
 ] 

Michael Ngo commented on MESOS-2523:


[~vinodkone] I narrowed it down by doing a test before and after this commit.  
The regression was introduced after this commit.

 Executor directory has incorrect permissions
 

 Key: MESOS-2523
 URL: https://issues.apache.org/jira/browse/MESOS-2523
 Project: Mesos
  Issue Type: Bug
Reporter: Michael Ngo
 Attachments: CorrectPermissions.png, WrongPermissions.png


 Currently my setup involves a mesos master on one node (nodeMM) and a mesos 
 slave on another node (nodeMS).  NodeMM runs the mesos-master process as the 
 flxjob user.  The framework (Chronos) attached to nodeMM submits tasks as 
 the flxjob user.  NodeMS runs the mesos-slave process as root cause 
 cgroups are being used.
 What's expected to happen is that the executed task will be executed by 
 flxjob and that directory in which code is executed is also owned by 
 flxjob. What actually happens is that the task is executed by flxjob, but 
 the directory in which code is executed is owned by root.
 Here are the arguments used by each process.
 * Master
 {noformat}/usr/local/sbin/mesos-master --cluster=Mesos HA Cluster 
 --log_dir=/var/log/mesos/master --work_dir=/var/lib/mesos/master 
 --zk=zk://172.16.3.70:2181/mesos --hostname=ip-172-16-15-74 --quorum=1 
 --zk_session_timeout=10secs --no-root_submissions{noformat}
 * Slave
 {noformat}/usr/local/sbin/mesos-slave --log_dir=/var/log/mesos/slave 
 --work_dir=/var/lib/mesos/slave --master=zk://172.16.3.70:2181/mesos 
 --hostname=172.16.3.215 --ip=172.16.3.215 --cgroups_enable_cfs 
 --cgroups_hierarchy=/cgroup --isolation=cgroups/cpu,cgroups/mem 
 --cgroups_limit_swap{noformat}
 Here is the output for returning the user identity via the id process.  
 Both the working (expected) and not working scenario yield the same output.
 {noformat}
 uid=501(flxjob) gid=501(flxjob) groups=501(flxjob),0(root)
 {noformat}
 I narrowed down where the issue was introduced.  It was introduced by [this 
 commit|https://github.com/apache/mesos/commit/25489e53e9f308c5fca3d0293aeceb716b53149d].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2171) Compilation error on Ubuntu 12.04

2015-03-20 Thread Mark Luntzel (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371546#comment-14371546
 ] 

Mark Luntzel commented on MESOS-2171:
-

Sorry I can't test this anymore, as I'm no longer with the company that uses 
it. 

 Compilation error on Ubuntu 12.04 
 --

 Key: MESOS-2171
 URL: https://issues.apache.org/jira/browse/MESOS-2171
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 0.22.0
 Environment: Ubuntu 12.04 
 java version 1.6.0_33
 gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
Reporter: Mark Luntzel
Priority: Blocker

 Following http://mesos.apache.org/gettingstarted/ we get a compilation error:
 ../../../3rdparty/libprocess/src/clock.cpp:167:36:   instantiated from here
 /usr/include/c++/4.6/tr1/functional:2040:46: error: invalid initialization of 
 reference of type 'std::listprocess::Timer' from expression of type 
 'std::listprocess::Timer'
 /usr/include/c++/4.6/tr1/functional:2040:46: error: return-statement with a 
 value, in function returning 'void' [-fpermissive]
 make[4]: *** [libprocess_la-clock.lo] Error 1
 More output here: 
 https://gist.github.com/luntzel/f5a3c62297aae812c986
 Please advise. Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Created] (MESOS-2524) Mesos-containerizer not linked from main documentation page.

2015-03-20 Thread Vinod Kone
No reason. Feel free to fix.

On Fri, Mar 20, 2015 at 6:02 AM, Joerg Schad (JIRA) j...@apache.org wrote:

 Joerg Schad created MESOS-2524:
 --

  Summary: Mesos-containerizer not linked from main
 documentation page.
  Key: MESOS-2524
  URL: https://issues.apache.org/jira/browse/MESOS-2524
  Project: Mesos
   Issue Type: Documentation
 Reporter: Joerg Schad
 Assignee: Joerg Schad
 Priority: Minor


 Is there any reason that the mesos-containerizer (
 http://mesos.apache.org/documentation/latest/mesos-containerizer/)
 documentation is not linked from the main documentation page (
 http://mesos.apache.org/documentation/latest/)? Both docker and external
 conterizer pages are linked from here.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



[jira] [Closed] (MESOS-1570) Make check Error when Building Mesos in a Docker container

2015-03-20 Thread Isabel Jimenez (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Jimenez closed MESOS-1570.
-
Resolution: Fixed

 Make check Error when Building Mesos in a Docker container 
 ---

 Key: MESOS-1570
 URL: https://issues.apache.org/jira/browse/MESOS-1570
 Project: Mesos
  Issue Type: Bug
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez
Priority: Minor
  Labels: Docker

 When building Mesos inside a Docker container, it's for the moment impossible 
 to run tests even when you run Docker in --privileged mode. There is a test 
 in stout that sets all the namespaces and libcontainer does not support 
 setting 'user' namespace (more information 
 [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]).
  This is the error:
 {code:title=Make check failed test|borderStyle=solid}
 [--] 1 test from OsSetnsTest
 [ RUN  ] OsSetnsTest.setns
 ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: 
 Failure
 os::setns(::getpid(), ns): Invalid argument
 [  FAILED  ] OsSetnsTest.setns (7 ms)
 [--] 1 test from OsSetnsTest (7 ms total)
 [  FAILED  ] 1 test, listed below:
 [  FAILED  ] OsSetnsTest.setns
  1 FAILED TEST
 {code}
 This can be disable as Mesos does not need to set 'user' namespace. I don't 
 know if Docker will support setting user namespace one day since it's a new 
 kernel feature, what could be the best approach to this issue? (disabling set 
 for 'user' namespace on stout, disabling just this test..)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1570) Make check Error when Building Mesos in a Docker container

2015-03-20 Thread Isabel Jimenez (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371949#comment-14371949
 ] 

Isabel Jimenez commented on MESOS-1570:
---

Namespace functions were moved from stout to /linux. Tests were changed. This 
is not an issue anymore. 
https://reviews.apache.org/r/27092 

 Make check Error when Building Mesos in a Docker container 
 ---

 Key: MESOS-1570
 URL: https://issues.apache.org/jira/browse/MESOS-1570
 Project: Mesos
  Issue Type: Bug
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez
Priority: Minor
  Labels: Docker

 When building Mesos inside a Docker container, it's for the moment impossible 
 to run tests even when you run Docker in --privileged mode. There is a test 
 in stout that sets all the namespaces and libcontainer does not support 
 setting 'user' namespace (more information 
 [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]).
  This is the error:
 {code:title=Make check failed test|borderStyle=solid}
 [--] 1 test from OsSetnsTest
 [ RUN  ] OsSetnsTest.setns
 ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: 
 Failure
 os::setns(::getpid(), ns): Invalid argument
 [  FAILED  ] OsSetnsTest.setns (7 ms)
 [--] 1 test from OsSetnsTest (7 ms total)
 [  FAILED  ] 1 test, listed below:
 [  FAILED  ] OsSetnsTest.setns
  1 FAILED TEST
 {code}
 This can be disable as Mesos does not need to set 'user' namespace. I don't 
 know if Docker will support setting user namespace one day since it's a new 
 kernel feature, what could be the best approach to this issue? (disabling set 
 for 'user' namespace on stout, disabling just this test..)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2367:
--
  Sprint: Twitter Mesos Q1 Sprint 5
Story Points: 5

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Assignee: Jie Yu
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-03-20 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-2367:
-

Assignee: Jie Yu

 Improve slave resiliency in the face of orphan containers 
 --

 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith
Assignee: Jie Yu
Priority: Critical

 Right now there's a case where a misbehaving executor can cause a slave 
 process to flap:
 {panel:title=Quote From [~jieyu]}
 {quote}
 1) User tries to kill an instance
 2) Slave sends {{KillTaskMessage}} to executor
 3) Executor sends kill signals to task processes
 4) Executor sends {{TASK_KILLED}} to slave
 5) Slave updates container cpu limit to be 0.01 cpus
 6) A user-process is still processing the kill signal
 7) the task process cannot exit since it has too little cpu share and is 
 throttled
 8) Executor itself terminates
 9) Slave tries to destroy the container, but cannot because the user-process 
 is stuck in the exit path.
 10) Slave restarts, and is constantly flapping because it cannot kill orphan 
 containers
 {quote}
 {panel}
 The slave's orphan container handling should be improved to deal with this 
 case despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2508) Slave recovering a docker container results in Unknow container error

2015-03-20 Thread Geoffroy Jabouley (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geoffroy Jabouley updated MESOS-2508:
-
Priority: Minor  (was: Major)

Changed priority to minor.

The fix is to start mesos-slave with only docker containerizer activated.

It is fine if only docker tasks are started with Mesos.

 Slave recovering a docker container results in Unknow container error
 ---

 Key: MESOS-2508
 URL: https://issues.apache.org/jira/browse/MESOS-2508
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, slave
Affects Versions: 0.21.1
 Environment: Ubuntu 14.04.2 LTS
 Docker 1.5.0 (same error with 1.4.1)
 Mesos 0.21.1 installed from mesosphere ubuntu repo
 Marathon 0.8.0 installed from mesosphere ubuntu repo
Reporter: Geoffroy Jabouley
Priority: Minor

 I'm seeing some error logs occuring during a slave recovery of a Mesos task 
 running into a docker container.
 It does not impede slave recovery process, as the mesos task is still active 
 and running on the slave afeter the recovery.
 But there is something not working properly when the slave is recovering my 
 docker container. The slave detects my container as an Unknown container
 Cluster status:
 - 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host.
 - checkpointing is activated on both slave and framework
 - use native docker containerizer
 - 1 mesos task, started using marathon, is running inside a docker container 
 and is monitored by the mesos-slave
 Action:
 - restart the mesos-slave process (sudo restart mesos-slave)
 Expected:
 - docker container still running
 - mesos task still running
 - no error in the mesos slave log regarding recovery process
 Seen:
 - docker container still running
 - mesos task still running
 - {color:red}Several errors *Unknown container* in the mesos slave log during 
 recovery process{color}
 ---
 For what it forth, here are my investigations:
 1) The mesos task starts fine in the docker container 
 *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker 
 container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by 
 Mesos docker containerizer _i guess_...
 {code}
 I0317 09:56:14.300439  2784 slave.cpp:1083] Got assigned task 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
 20150311-150951-3982541578-5050-50860-
 I0317 09:56:14.380702  2784 slave.cpp:1193] Launching task 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
 20150311-150951-3982541578-5050-50860-
 I0317 09:56:14.384466  2784 slave.cpp:3997] Launching executor 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
 20150311-150951-3982541578-5050-50860- in work directory 
 '/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad'
 I0317 09:56:14.390207  2784 slave.cpp:1316] Queuing task 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
 '20150311-150951-3982541578-5050-50860-
 I0317 09:56:14.421787  2782 docker.cpp:927] Starting container 
 'adb71dc4-c07d-42a9-8fed-264c241668ad' for task 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework 
 '20150311-150951-3982541578-5050-50860-'
 I0317 09:56:15.784143  2781 docker.cpp:633] Checkpointing pid 27080 to 
 '/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid'
 I0317 09:56:15.789443  2784 slave.cpp:2840] Monitoring executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
 '20150311-150951-3982541578-5050-50860-' in container 
 'adb71dc4-c07d-42a9-8fed-264c241668ad'
 I0317 09:56:15.862642  2784 slave.cpp:1860] Got registration for executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
 20150311-150951-3982541578-5050-50860- from 
 executor(1)@10.195.96.237:36021
 I0317 09:56:15.865319  2784 slave.cpp:1979] Flushing queued task 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor 
 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
 20150311-150951-3982541578-5050-50860-
 I0317 09:56:15.885414  2787 slave.cpp:2215] Handling status update 
 TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
 test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
 

[jira] [Commented] (MESOS-2490) Enable the allocator to distinguish between role and framework reservations

2015-03-20 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372263#comment-14372263
 ] 

Michael Park commented on MESOS-2490:
-

https://reviews.apache.org/r/32333/

 Enable the allocator to distinguish between role and framework reservations
 ---

 Key: MESOS-2490
 URL: https://issues.apache.org/jira/browse/MESOS-2490
 Project: Mesos
  Issue Type: Task
  Components: allocation
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 h3. Goal
 This is the subsequent task after 
 [MESOS-2489|https://issues.apache.org/jira/browse/MESOS-2489] which enables a 
 framework to send back a reservation offer operation to reserve resources for 
 its role. The goal for this ticket is to teach the allocator to distinguish 
 between a role reservation and framework reservation. Note in particular that 
 this means updating the sorter is out of scope of this task. The goal is 
 strictly to teach the allocator how to send offers to a particular framework 
 rather than a role.
 h3. Expected Outcome
 * The framework can send back reservation operations to (un)reserve resources 
 for itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2489) Enable a framework to perform reservation operations.

2015-03-20 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-2489:

Sprint: Mesosphere Q1 Sprint 6 - 4/3

 Enable a framework to perform reservation operations.
 -

 Key: MESOS-2489
 URL: https://issues.apache.org/jira/browse/MESOS-2489
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 h3. Goal
 This is the first step to supporting dynamic reservations. The goal of this 
 task is to enable a framework to reply to a resource offer with *Reserve* and 
 *Unreserve* offer operations as defined by {{Offer::Operation}} in 
 {{mesos.proto}}.
 h3. Overview
 It's divided into a few subtasks so that it's clear what the small chunks to 
 be addressed are. In summary, we need to introduce the 
 {{Resource::ReservationInfo}} protobuf message to encapsulate the reservation 
 information, enable the C++ {{Resources}} class to handle it then enable the 
 master to handle reservation operations.
 h3. Expected Outcome
 * The framework will be able to send back reservation operations to 
 (un)reserve resources.
 * The reservations are kept only in the master since we don't send the 
 {{CheckpointResources}} message to checkpoint the reservations on the slave 
 yet.
 * The reservations are considered to be reserved for the framework's role.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2490) Enable the allocator to distinguish between role and framework reservations

2015-03-20 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372263#comment-14372263
 ] 

Michael Park edited comment on MESOS-2490 at 3/20/15 10:47 PM:
---

[r32333|https://reviews.apache.org/r/32333/]


was (Author: mcypark):
https://reviews.apache.org/r/32333/

 Enable the allocator to distinguish between role and framework reservations
 ---

 Key: MESOS-2490
 URL: https://issues.apache.org/jira/browse/MESOS-2490
 Project: Mesos
  Issue Type: Task
  Components: allocation
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 h3. Goal
 This is the subsequent task after 
 [MESOS-2489|https://issues.apache.org/jira/browse/MESOS-2489] which enables a 
 framework to send back a reservation offer operation to reserve resources for 
 its role. The goal for this ticket is to teach the allocator to distinguish 
 between a role reservation and framework reservation. Note in particular that 
 this means updating the sorter is out of scope of this task. The goal is 
 strictly to teach the allocator how to send offers to a particular framework 
 rather than a role.
 h3. Expected Outcome
 * The framework can send back reservation operations to (un)reserve resources 
 for itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2490) Enable the allocator to distinguish between role and framework reservations

2015-03-20 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-2490:

Sprint: Mesosphere Q1 Sprint 6 - 4/3

 Enable the allocator to distinguish between role and framework reservations
 ---

 Key: MESOS-2490
 URL: https://issues.apache.org/jira/browse/MESOS-2490
 Project: Mesos
  Issue Type: Task
  Components: allocation
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 h3. Goal
 This is the subsequent task after 
 [MESOS-2489|https://issues.apache.org/jira/browse/MESOS-2489] which enables a 
 framework to send back a reservation offer operation to reserve resources for 
 its role. The goal for this ticket is to teach the allocator to distinguish 
 between a role reservation and framework reservation. Note in particular that 
 this means updating the sorter is out of scope of this task. The goal is 
 strictly to teach the allocator how to send offers to a particular framework 
 rather than a role.
 h3. Expected Outcome
 * The framework can send back reservation operations to (un)reserve resources 
 for itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)