[jira] [Updated] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-20 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6180:
-
Attachment: RoleTest.ImplicitRoleRegister.txt

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> RoleTest.ImplicitRoleRegister.txt, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-20 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508838#comment-15508838
 ] 

Greg Mann commented on MESOS-6180:
--

Another common error seen when this issue manifests is:
{code}
Recovery failed: Failed to recover registrar: Failed to perform fetch within 
1mins
{code}
See the file {{RoleTest.ImplicitRoleRegister.txt}} for the full test log.

[~haosd...@gmail.com], there is a review 
[here|https://reviews.apache.org/r/41665/] proposing the {{in_memory}} registry 
for tests. I'm currently trying to figure out whether this is a legitimate bug 
or simply the result of an unreasonable load put on the machine.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> RoleTest.ImplicitRoleRegister.txt, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508695#comment-15508695
 ] 

Charles Allen commented on MESOS-6210:
--

Submitted https://reviews.apache.org/r/52105/

> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>Assignee: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6003) Add logging module for logging to an external program

2016-09-20 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508298#comment-15508298
 ] 

Joseph Wu commented on MESOS-6003:
--

Module: https://reviews.apache.org/r/51257/
Docs: https://reviews.apache.org/r/51258/

> Add logging module for logging to an external program
> -
>
> Key: MESOS-6003
> URL: https://issues.apache.org/jira/browse/MESOS-6003
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Will Rouesnel
>Assignee: Will Rouesnel
>Priority: Minor
>
> In the vein of the logrotate module for logging, there should be a similar 
> module which provides support for logging to an arbitrary log handling 
> program, with suitable task metadata provided by environment variables or 
> command line arguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4760) Expose metrics and gauges for fetcher cache usage and hit rate

2016-09-20 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508280#comment-15508280
 ] 

Greg Mann commented on MESOS-4760:
--

[~mrbrowning], it looks like this is on hold for the time being; if that's 
correct, would you mind discarding the associated review request for now? Do 
you have any plans to continue this ticket in the future? If not, you could 
unassign yourself as well.

> Expose metrics and gauges for fetcher cache usage and hit rate
> --
>
> Key: MESOS-4760
> URL: https://issues.apache.org/jira/browse/MESOS-4760
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher, statistics
>Reporter: Michael Browning
>Assignee: Michael Browning
>Priority: Minor
>  Labels: features, fetcher, statistics, uber
>
> To evaluate the fetcher cache and calibrate the value of the 
> fetcher_cache_size flag, it would be useful to have metrics and gauges on 
> agents that expose operational statistics like cache hit rate, occupied cache 
> size, and time spent downloading resources that were not present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3926) Modularize URI fetcher plugin interface.

2016-09-20 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508216#comment-15508216
 ] 

Greg Mann commented on MESOS-3926:
--

[~lins05], is it accurate that this ticket is blocked by MESOS-5261? (which is 
itself blocked by MESOS-5259) If so, perhaps we should discard the associated 
review request for the time being until the dependencies for this work are 
completed? We can always reopen it when the time comes.

> Modularize URI fetcher plugin interface.  
> --
>
> Key: MESOS-3926
> URL: https://issues.apache.org/jira/browse/MESOS-3926
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Jie Yu
>Assignee: Shuai Lin
>  Labels: fetcher, mesosphere, module
>
> So that we can add custom URI fetcher plugins using modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1104) Move linux/fs.hpp out of `mesos` namespace in linux/fs.h

2016-09-20 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508185#comment-15508185
 ] 

Greg Mann commented on MESOS-1104:
--

[~xds2000], sorry for the confusion! And thanks for your patch :)

I think that this issue still makes sense - [~mcypark], I don't think the 
intention was to consolidate implementations. Rather, as Deshi suggested, I 
think the TODO indicates that a namespace nested within {{mesos::}} doesn't 
make sense for a header which provides generic Linux functionality that isn't 
Mesos-specific. Looking at the namespaces declared within {{src/linux/}}, it's 
not entirely consistent, but in most cases we do not use the {{mesos::}} prefix.

Deshi, if you want to pick this back up, we could try to find a shepherd for 
the ticket and continue. If you're busy with other things and don't have time, 
would you mind discarding the review request? We can always reopen it in the 
future.

> Move linux/fs.hpp out of `mesos` namespace in linux/fs.h
> 
>
> Key: MESOS-1104
> URL: https://issues.apache.org/jira/browse/MESOS-1104
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Archana kumari
>Assignee: Deshi Xiao
>  Labels: mesosphere, newbie
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6213) Build failure on OSX

2016-09-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507725#comment-15507725
 ] 

Charles Allen commented on MESOS-6213:
--

Something was stale in configs, fresh reboot solved this.

> Build failure on OSX
> 
>
> Key: MESOS-6213
> URL: https://issues.apache.org/jira/browse/MESOS-6213
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Charles Allen
>
> Building on OSX is giving the following error.
> {code}
> In file included from 
> ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops.h:184:
> ../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops_internals_macosx.h:173:9:
>  error: 'OSAtomicCompareAndSwap64Barrier' is deprecated: first
>   deprecated in macOS 10.12 - Use std::atomic_compare_exchange_strong() 
> from  instead [-Werror,-Wdeprecated-declarations]
> if (OSAtomicCompareAndSwap64Barrier(
> ^
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libkern/OSAtomicDeprecated.h:645:9:
>  note:
>   'OSAtomicCompareAndSwap64Barrier' has been explicitly marked deprecated 
> here
> boolOSAtomicCompareAndSwap64Barrier( int64_t __oldValue, int64_t 
> __newValue,
> ^
> {code}
> Protobuf is not listed as a component so I just set it as {{build}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507560#comment-15507560
 ] 

haosdent commented on MESOS-6180:
-

Highly appreciate [~greggomann] to help reproduce in my AWS instance!!! The 
reason why I couldn't reproduce before is I run {{stress}} and {{mesos-tests}} 
in the separate disk which different with the root disk. So the {{stress}} 
didn't affect the root filesystem Linux used. If I run {{stress}} in the root 
disk and run {{mesos-test}} in the separate disk, then it could reproduce in 
few test iterations.

A workaround for this is to use {{flags.registry = "in_memory"}} when run tests 
which I have not reproduce the errors after use it.  But now I think the test 
cases failure should be expected because the root filesystem could not work as 
normal. Do you think we should use {{flags.registry = "in_memory"}} or just 
ignore these failures? cc [~jieyu] [~vinodkone] [~kaysoky]

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6214) Containerizers assume caller will call 'destroy' if 'launch' fails.

2016-09-20 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-6214:
--

 Summary: Containerizers assume caller will call 'destroy' if 
'launch' fails.
 Key: MESOS-6214
 URL: https://issues.apache.org/jira/browse/MESOS-6214
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Benjamin Mahler


The planned API for nested containers is to allow launching, waiting (for 
termination), and killing (currently only SIGKILL) of the nested container. 
Note that this API provides no mechanism for "cleaning up" the container 
because it will implicitly do so once the container terminates.

However, the containerizer currently assumes that the caller will call destroy 
if the launch fails. In order to implement the agent's API for managing nested 
containers, we will have to set up a failure continuation to call destroy to 
ensure the cleanup occurs correctly.

Ideally, the API of the containerizer does not require the caller to call 
destroy after a launch failure, given that the launch did not succeed it seems 
counter-intuitive for the responsibility of clean up to be on the caller. In 
addition, in the container termination case, the containerizer will implicitly 
clean up (so this seems inconsistent as well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6000) Overlayfs backend cannot support the image with numerous layers.

2016-09-20 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6000:
--
Fix Version/s: 1.1.0

> Overlayfs backend cannot support the image with numerous layers.
> 
>
> Key: MESOS-6000
> URL: https://issues.apache.org/jira/browse/MESOS-6000
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 15
> Or any os with kernel 4.0+
>Reporter: Gilbert Song
>Assignee: Zhitao Li
>  Labels: backend, containerizer, overlayfs
> Fix For: 1.1.0
>
>
> This issue is exposed when testing unified containerizer with overlayfs 
> backend using any image with numerous layers (e.g., 38 layers). It can be 
> reproduced by using this image: `gilbertsong/cirros:34` (for anyone who wants 
> to test it out).
> Here is the partial log:
> {noformat}
> I0805 21:50:02.631873 11136 provisioner.cpp:315] Provisioning image rootfs 
> '/tmp/provisioner/containers/36c69ade-69db-4de3-9cd4-18b9b9c99e73/backends/overlay/rootfses/ba255b76-8326-4611-beb5-002f202b52e0'
>  for container 36c69ade-69db-4de3-9cd4-18b9b9c99e73 using overlay backend
> I0805 21:50:02.632990 11138 overlay.cpp:156] Provisioning image rootfs with 
> overlayfs: 
> 

[jira] [Updated] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-20 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6002:
--
Assignee: Qian Zhang

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>  Labels: aufs, backend, containerizer
> Attachments: whiteout.diff
>
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004930 24314 master.cpp:441] 
> Master only allowing authenticated agents to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004935 24314 master.cpp:454] 
> Master only allowing authenticated HTTP frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004942 24314 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/0z753P/credentials'
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005018 24314 master.cpp:499] Using 
> default 'crammd5' authenticator
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005101 24314 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005152 24314 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005192 24314 http.cpp:883] 

[jira] [Commented] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507132#comment-15507132
 ] 

Charles Allen commented on MESOS-6210:
--

[~haosd...@gmail.com] Thanks! I'll manually test it internally and report back 
here.

> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>Assignee: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6063) Track recovered and prepared subsystems for a container

2016-09-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507118#comment-15507118
 ] 

Jie Yu commented on MESOS-6063:
---

commit 27a2fb864f7967142f346c2f594bbba494209f57
Author: haosdent huang 
Date:   Tue Sep 20 09:56:04 2016 -0700

Tracked recovered and prepared cgroups subsystems for containers.

Recover newly added cgroups subsystems on existing containers would
fail, and continue to perform the `update` and other operations of
the newly added subsystems on them don't make sense. This patch add
the tracking for the recovered or prepared cgroups subsystems of a
container and skip performing unnecessary subsystem operations on the
container if the subsystem is never recovered or prepared.

Review: https://reviews.apache.org/r/51631/

> Track recovered and prepared subsystems for a container
> ---
>
> Key: MESOS-6063
> URL: https://issues.apache.org/jira/browse/MESOS-6063
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups
>Reporter: haosdent
>Assignee: haosdent
>  Labels: cgroups
>
> Currently, when we restart Mesos Agent with different cgroups subsystems, the 
> exist containers would recover failed on newly added subsystems. In this 
> case, we ignore them and continue to perform `usage`, `status` and `cleanup` 
> on them.  It would be better that we track recovered and prepared subsystems 
> for a container. Then ignore perform `update`, `wait`, `usage`, `status` on 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6210:

Assignee: Charles Allen

> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>Assignee: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507096#comment-15507096
 ] 

haosdent commented on MESOS-6210:
-

Hi, [~drcrallen] thanks a lot for your help! For http endpoints test cases, you 
may take a look at 
https://github.com/apache/mesos/blob/master/src/tests/master_tests.cpp#L3044

But for redirect, it may a bit different to write test cases because it 
requires we start multiple running masters in the test cases which we have not 
supported yet. In this case, we test it manually before.

> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507095#comment-15507095
 ] 

Charles Allen commented on MESOS-6210:
--

I meant to open that PR internally for internal review before going to 
apache-proper, sorry for the noise.

> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507087#comment-15507087
 ] 

ASF GitHub Bot commented on MESOS-6210:
---

Github user drcrallen closed the pull request at:

https://github.com/apache/mesos/pull/169


> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507085#comment-15507085
 ] 

ASF GitHub Bot commented on MESOS-6210:
---

GitHub user drcrallen opened a pull request:

https://github.com/apache/mesos/pull/169

Add smarter master redirects.

* Fix MESOS-6210
* Any path after `/redirect` is now appended to the response
* Does not make any guarantees about non-path components of
the original request
* Updated docs to clarify this behavior.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/metamx/mesos MESOS-6210

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mesos/pull/169.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #169






> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507019#comment-15507019
 ] 

Charles Allen commented on MESOS-6210:
--

Thanks [~vinodkone]. I took a crack at a patch, but I can't seem to find any 
test coverage for routing or {{Master::Http}}. Is there a place where tests 
should go?

> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6213) Build failure on OSX

2016-09-20 Thread Charles Allen (JIRA)
Charles Allen created MESOS-6213:


 Summary: Build failure on OSX
 Key: MESOS-6213
 URL: https://issues.apache.org/jira/browse/MESOS-6213
 Project: Mesos
  Issue Type: Bug
  Components: build
Reporter: Charles Allen


Building on OSX is giving the following error.

{code}
In file included from 
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops.h:184:
../3rdparty/protobuf-2.6.1/src/google/protobuf/stubs/atomicops_internals_macosx.h:173:9:
 error: 'OSAtomicCompareAndSwap64Barrier' is deprecated: first
  deprecated in macOS 10.12 - Use std::atomic_compare_exchange_strong() 
from  instead [-Werror,-Wdeprecated-declarations]
if (OSAtomicCompareAndSwap64Barrier(
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libkern/OSAtomicDeprecated.h:645:9:
 note:
  'OSAtomicCompareAndSwap64Barrier' has been explicitly marked deprecated 
here
boolOSAtomicCompareAndSwap64Barrier( int64_t __oldValue, int64_t __newValue,
^
{code}

Protobuf is not listed as a component so I just set it as {{build}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-20 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506866#comment-15506866
 ] 

Qian Zhang edited comment on MESOS-6002 at 9/20/16 3:36 PM:


The reason that the test 
{{ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout}} failed is the file 
{{/etc/rc3.d/S40-network}} is not removed when we provision the rootfs for the 
container. Actually in one of layers of {{cirros}} image, I see there is a 
whiteout file {{/etc/rc3.d/.wh.S40-network}}, so we should remove 
{{/etc/rc3.d/S40-network}} based on this whiteout file in 
{{ProvisionerProcess::__provision()}}. But the problem is, in 
{{AufsBackendProcess::provision()}}, after we mount all layers of {{cirros}} 
image to rootfs of the container, the whiteout file 
{{/etc/rc3.d/.wh.S40-network}} disappears, so we do not remove 
{{/etc/rc3.d/S40-network}}. There might be something wrong in 
{{AufsBackendProcess::provision()}} about how we do the mount with aufs.


was (Author: qianzhang):
The reason that the test 
{{ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout}} failed is the 
whiteout file {{/etc/rc3.d/S40-network}} is not removed when we provision the 
rootfs for the container. Actually in one of layers of {{cirros}} image, I see 
there is a file {{/etc/rc3.d/.wh.S40-network}}, so we should remove 
{{/etc/rc3.d/S40-network}} based on this file in 
{{ProvisionerProcess::__provision()}}. But the problem is, in 
{{AufsBackendProcess::provision()}}, after we mount all layers of {{cirros}} 
image to rootfs of the container, the file {{/etc/rc3.d/.wh.S40-network}} 
disappears, so we do not remove {{/etc/rc3.d/S40-network}}. There might be 
something wrong in {{AufsBackendProcess::provision()}} about how we do the 
mount with aufs.

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
> Attachments: whiteout.diff
>
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" 

[jira] [Commented] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-20 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506866#comment-15506866
 ] 

Qian Zhang commented on MESOS-6002:
---

The reason that the test 
{{ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout}} failed is the 
whiteout file {{/etc/rc3.d/S40-network}} is not removed when we provision the 
rootfs for the container. Actually in one of layers of {{cirros}} image, I see 
there is a file {{/etc/rc3.d/.wh.S40-network}}, so we should remove 
{{/etc/rc3.d/S40-network}} based on this file in 
{{ProvisionerProcess::__provision()}}. But the problem is, in 
{{AufsBackendProcess::provision()}}, after we mount all layers of {{cirros}} 
image to rootfs of the container, the file {{/etc/rc3.d/.wh.S40-network}} 
disappears, so we do not remove {{/etc/rc3.d/S40-network}}. There might be 
something wrong in {{AufsBackendProcess::provision()}} about how we do the 
mount with aufs.

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
> Attachments: whiteout.diff
>
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004930 24314 master.cpp:441] 
> Master only allowing authenticated agents to register
> [20:11:25]W:   [Step 

[jira] [Updated] (MESOS-6208) Containers that use the Mesos containerizer but don't want to provision a container image fail to validate.

2016-09-20 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6208:
---
 Assignee: Alexander Rukletsov
   Sprint: Mesosphere Sprint 43
 Story Points: 1
   Labels: mesosphere  (was: )
Fix Version/s: 1.1.0

> Containers that use the Mesos containerizer but don't want to provision a 
> container image fail to validate.
> ---
>
> Key: MESOS-6208
> URL: https://issues.apache.org/jira/browse/MESOS-6208
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Mesos HEAD, change was introduced with 
> e65f580bf0cbea64cedf521cf169b9b4c9f85454
>Reporter: Jan Schlicht
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> Tasks using  features like volumes or CNI in their containers, have to define 
> these in {{TaskInfo.container}}. When these tasks don't want/need to 
> provision a container image, neither {{ContainerInfo.docker}} nor 
> {{ContainerInfo.mesos}} will be set. Nevertheless, the container type in 
> {{ContainerInfo.type}} needs to be set, because it is a required field.
> In that case, the recently introduced validation rules in 
> {{master/validation.cpp}} ({{validateContainerInfo}} will fail, which isn't 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6157) ContainerInfo is not validated.

2016-09-20 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6157:
---
Shepherd: Jie Yu
  Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43  (was: Mesosphere 
Sprint 42)
Story Points: 3  (was: 1)

> ContainerInfo is not validated.
> ---
>
> Key: MESOS-6157
> URL: https://issues.apache.org/jira/browse/MESOS-6157
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: containerizer, mesos-containerizer, mesosphere
> Fix For: 1.1.0
>
>
> Currently Mesos does not validate {{ContainerInfo}} provided with 
> {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be 
> accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6135) ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky

2016-09-20 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506780#comment-15506780
 ] 

Neil Conway commented on MESOS-6135:


I'd opt for silencing the executor's logging, because it is a bit more precise.

> ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky
> --
>
> Key: MESOS-6135
> URL: https://issues.apache.org/jira/browse/MESOS-6135
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: Ubuntu 14, libev, non-SSL
>Reporter: Greg Mann
>  Labels: logging, mesosphere
>
> Observed in our internal CI:
> {code}
> [19:53:51] :   [Step 10/10] [ RUN  ] 
> ContainerLoggerTest.LOGROTATE_RotateInSandbox
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.460055 23729 cluster.cpp:157] 
> Creating default 'local' authorizer
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.468907 23729 leveldb.cpp:174] 
> Opened db in 8.730166ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472470 23729 leveldb.cpp:181] 
> Compacted db in 3.544028ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472491 23729 leveldb.cpp:196] 
> Created db iterator in 3678ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472496 23729 leveldb.cpp:202] 
> Seeked to beginning of db in 673ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472499 23729 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 256ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472510 23729 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472709 23744 recover.cpp:451] 
> Starting replica recovery
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472820 23748 recover.cpp:477] 
> Replica is in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473059 23748 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(177)@172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473146 23746 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473234 23745 recover.cpp:568] 
> Updating replica status to STARTING
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473629 23747 master.cpp:379] 
> Master 6d1b2727-f42d-446b-b2f8-a9f7e7667340 (ip-172-30-2-89.mesosphere.io) 
> started on 172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473644 23747 master.cpp:381] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ceLmd7/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/ceLmd7/master" --zk_session_timeout="10secs"
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473832 23747 master.cpp:431] 
> Master only allowing authenticated frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473844 23747 master.cpp:445] 
> Master only allowing authenticated agents to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473850 23747 master.cpp:458] 
> Master only allowing authenticated HTTP frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473856 23747 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/ceLmd7/credentials'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473975 23747 master.cpp:503] Using 
> default 'crammd5' authenticator
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474028 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474097 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474161 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474242 23747 master.cpp:583] 
> Authorization enabled
> [19:53:51]W:   [Step 

[jira] [Comment Edited] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505990#comment-15505990
 ] 

Stéphane Cottin edited comment on MESOS-6002 at 9/20/16 8:18 AM:
-

This simple workaround, testing is file exists before trying to delete it, 
works for me.
Of course this is just a temporary hack, a proper whiteout opaque dirs handling 
should be added.


was (Author: kaalh):
This simple workaround, testing is file exists before trying to delete it, 
works for me.
Off course this is just a temporary hack, a proper whiteout opaque dirs 
handling should be added.

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
> Attachments: whiteout.diff
>
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004930 24314 master.cpp:441] 
> Master only allowing authenticated agents to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004935 24314 master.cpp:454] 
> Master only allowing authenticated HTTP frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004942 24314 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/0z753P/credentials'
> [20:11:25]W:   [Step 10/10] I0805 

[jira] [Updated] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stéphane Cottin updated MESOS-6002:
---
Attachment: whiteout.diff

This simple workaround, testing is file exists before trying to delete it, 
works for me.
Off course this is just a temporary hack, a proper whiteout opaque dirs 
handling should be added.

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
> Attachments: whiteout.diff
>
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004930 24314 master.cpp:441] 
> Master only allowing authenticated agents to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004935 24314 master.cpp:454] 
> Master only allowing authenticated HTTP frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004942 24314 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/0z753P/credentials'
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005018 24314 master.cpp:499] Using 
> default 'crammd5' authenticator
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005101 24314 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [20:11:25]W:   [Step 10/10] I0805 

[jira] [Commented] (MESOS-6208) Containers that use the Mesos containerizer but don't want to provision a container image fail to validate.

2016-09-20 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505978#comment-15505978
 ] 

Jan Schlicht commented on MESOS-6208:
-

Also see discussion in https://reviews.apache.org/r/51865/

> Containers that use the Mesos containerizer but don't want to provision a 
> container image fail to validate.
> ---
>
> Key: MESOS-6208
> URL: https://issues.apache.org/jira/browse/MESOS-6208
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Mesos HEAD, change was introduced with 
> e65f580bf0cbea64cedf521cf169b9b4c9f85454
>Reporter: Jan Schlicht
>
> Tasks using  features like volumes or CNI in their containers, have to define 
> these in {{TaskInfo.container}}. When these tasks don't want/need to 
> provision a container image, neither {{ContainerInfo.docker}} nor 
> {{ContainerInfo.mesos}} will be set. Nevertheless, the container type in 
> {{ContainerInfo.type}} needs to be set, because it is a required field.
> In that case, the recently introduced validation rules in 
> {{master/validation.cpp}} ({{validateContainerInfo}} will fail, which isn't 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504388#comment-15504388
 ] 

Stéphane Cottin edited comment on MESOS-6002 at 9/20/16 8:09 AM:
-

Same issue using overlayfs :

{code}
Failed to remove whiteout file 
'/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq':
 No such file or directory
{code}

Can be reproduced with official postgres, rabbitmq and many more docker images, 
all deleting the same folder in multiple RUN calls.

update: forgot to mention, patched with https://reviews.apache.org/r/51124/
update2: tested unpatched, same behavior


was (Author: kaalh):
Same issue using overlayfs :

{code}
Failed to remove whiteout file 
'/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq':
 No such file or directory
{code}

Can be reproduced with official postgres, rabbitmq and many more docker images, 
all deleting the same folder in multiple RUN calls.

update: forgot to mention, patched with https://reviews.apache.org/r/51124/ 

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"

[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-20 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505959#comment-15505959
 ] 

Marc Villacorta commented on MESOS-6202:


Sure, here you have it: MESOS-6212

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-09-20 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-6212:
--

 Summary: Validate the name format of mesos-managed docker 
containers
 Key: MESOS-6212
 URL: https://issues.apache.org/jira/browse/MESOS-6212
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Affects Versions: 1.0.1
Reporter: Marc Villacorta
Priority: Minor


Validate the name format of mesos-managed docker containers in order to avoid 
false positives when looking for orphaned mesos tasks.

Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ are 
wrongly terminated when {{--docker_kill_orphans}} is set to true (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6169) --docker_config doesn't work with amazon-ecr-credential-helper

2016-09-20 Thread Mao Geng (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505909#comment-15505909
 ] 

Mao Geng commented on MESOS-6169:
-

Sorry. The error only occurs when the .docker/config.json has no "auths". I 
corrected the description accordingly. 
When I set config.json with "auths" like below (and restarted mesos agent), 
actually it works well with the 
https://github.com/awslabs/amazon-ecr-credential-helper. 
{code}
{
"credsStore": "ecr-login",
"auths": {
".dkr.ecr.us-east-1.amazonaws.com": {
}
}
}
{code}

> --docker_config doesn't work with amazon-ecr-credential-helper
> --
>
> Key: MESOS-6169
> URL: https://issues.apache.org/jira/browse/MESOS-6169
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Mao Geng
>Assignee: Gilbert Song
>
> We are using AWS ECR as docker registry and using 
> https://github.com/awslabs/amazon-ecr-credential-helper to get credential 
> automatically. 
> As amazon-ecr-credential-helper required, we set a .docker/config.json file 
> like below: 
> {code}
> {
> "credsStore": "ecr-login"
> }
> {code}
> According to the "credsStore" field, docker engine will invoke a 
> "docker-credential-ecr-login" command (which we've installed into /usr/bin/) 
> to get registry credential whenever required, for example when executing 
> docker pull/push. 
> This works fine when we tar the .docker/config.json and use uris prarameter 
> to pull the tar.gz file for every task using docker image. 
> But when I try the new --docker_config option, it doesn't work. The task 
> failed to pull the image from ECR. The error message is 
> {code}
> Failed to launch container: Failed to run 'docker -H 
> unix:///var/run/docker.sock pull 
> .dkr.ecr.us-east-1.amazonaws.com/:latest': exited 
> with status 1; stderr='WARNING: Error loading config 
> file:/tmp/28Vd2O/.dockercfg - The Auth config file is empty unauthorized: 
> authentication required '
> {code}
> Checked the source at 
> https://github.com/apache/mesos/blob/master/src/docker/docker.cpp, but don't 
> understand why above errors message says loading a temp .dockercfg file, 
> which doesn't exist btw. I assume mesos should pull the image using the 
> config.json file I set to --docker_config, right? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6169) --docker_config doesn't work with amazon-ecr-credential-helper

2016-09-20 Thread Mao Geng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mao Geng updated MESOS-6169:

Description: 
We are using AWS ECR as docker registry and using 
https://github.com/awslabs/amazon-ecr-credential-helper to get credential 
automatically. 

As amazon-ecr-credential-helper required, we set a .docker/config.json file 
like below: 
{code}
{
"credsStore": "ecr-login"
{code}
According to the "credsStore" field, docker engine will invoke a 
"docker-credential-ecr-login" command (which we've installed into /usr/bin/) to 
get registry credential whenever required, for example when executing docker 
pull/push. 
This works fine when we tar the .docker/config.json and use uris prarameter to 
pull the tar.gz file for every task using docker image. 

But when I try the new --docker_config option, it doesn't work. The task failed 
to pull the image from ECR. The error message is 
{code}
Failed to launch container: Failed to run 'docker -H 
unix:///var/run/docker.sock pull 
.dkr.ecr.us-east-1.amazonaws.com/:latest': exited with 
status 1; stderr='WARNING: Error loading config file:/tmp/28Vd2O/.dockercfg - 
The Auth config file is empty unauthorized: authentication required '
{code}

Checked the source at 
https://github.com/apache/mesos/blob/master/src/docker/docker.cpp, but don't 
understand why above errors message says loading a temp .dockercfg file, which 
doesn't exist btw. I assume mesos should pull the image using the config.json 
file I set to --docker_config, right? 

  was:
We are using AWS ECR as docker registry and using 
https://github.com/awslabs/amazon-ecr-credential-helper to get credential 
automatically. 

As amazon-ecr-credential-helper required, we set a .docker/config.json file 
like below: 
{code}
{
"credsStore": "ecr-login",
"auths": {
".dkr.ecr.us-east-1.amazonaws.com": {
}
}
}
{code}
According to the "credsStore" field, docker engine will invoke a 
"docker-credential-ecr-login" command (which we've installed into /usr/bin/) to 
get registry credential whenever required, for example when executing docker 
pull/push. 
This works fine when we tar the .docker/config.json and use uris prarameter to 
pull the tar.gz file for every task using docker image. 

But when I try the new --docker_config option, it doesn't work. The task failed 
to pull the image from ECR. The error message is 
{code}
Failed to launch container: Failed to run 'docker -H 
unix:///var/run/docker.sock pull 
.dkr.ecr.us-east-1.amazonaws.com/:latest': exited with 
status 1; stderr='WARNING: Error loading config file:/tmp/28Vd2O/.dockercfg - 
The Auth config file is empty unauthorized: authentication required '
{code}

Checked the source at 
https://github.com/apache/mesos/blob/master/src/docker/docker.cpp, but don't 
understand why above errors message says loading a temp .dockercfg file, which 
doesn't exist btw. I assume mesos should pull the image using the config.json 
file I set to --docker_config, right? 


> --docker_config doesn't work with amazon-ecr-credential-helper
> --
>
> Key: MESOS-6169
> URL: https://issues.apache.org/jira/browse/MESOS-6169
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Mao Geng
>Assignee: Gilbert Song
>
> We are using AWS ECR as docker registry and using 
> https://github.com/awslabs/amazon-ecr-credential-helper to get credential 
> automatically. 
> As amazon-ecr-credential-helper required, we set a .docker/config.json file 
> like below: 
> {code}
> {
> "credsStore": "ecr-login"
> {code}
> According to the "credsStore" field, docker engine will invoke a 
> "docker-credential-ecr-login" command (which we've installed into /usr/bin/) 
> to get registry credential whenever required, for example when executing 
> docker pull/push. 
> This works fine when we tar the .docker/config.json and use uris prarameter 
> to pull the tar.gz file for every task using docker image. 
> But when I try the new --docker_config option, it doesn't work. The task 
> failed to pull the image from ECR. The error message is 
> {code}
> Failed to launch container: Failed to run 'docker -H 
> unix:///var/run/docker.sock pull 
> .dkr.ecr.us-east-1.amazonaws.com/:latest': exited 
> with status 1; stderr='WARNING: Error loading config 
> file:/tmp/28Vd2O/.dockercfg - The Auth config file is empty unauthorized: 
> authentication required '
> {code}
> Checked the source at 
> https://github.com/apache/mesos/blob/master/src/docker/docker.cpp, but don't 
> understand why above errors message says loading a temp .dockercfg file, 
> which doesn't exist btw. I assume mesos should pull the image using the 
> config.json file I set to --docker_config, right? 

[jira] [Updated] (MESOS-6169) --docker_config doesn't work with amazon-ecr-credential-helper

2016-09-20 Thread Mao Geng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mao Geng updated MESOS-6169:

Description: 
We are using AWS ECR as docker registry and using 
https://github.com/awslabs/amazon-ecr-credential-helper to get credential 
automatically. 

As amazon-ecr-credential-helper required, we set a .docker/config.json file 
like below: 
{code}
{
"credsStore": "ecr-login"
}
{code}
According to the "credsStore" field, docker engine will invoke a 
"docker-credential-ecr-login" command (which we've installed into /usr/bin/) to 
get registry credential whenever required, for example when executing docker 
pull/push. 
This works fine when we tar the .docker/config.json and use uris prarameter to 
pull the tar.gz file for every task using docker image. 

But when I try the new --docker_config option, it doesn't work. The task failed 
to pull the image from ECR. The error message is 
{code}
Failed to launch container: Failed to run 'docker -H 
unix:///var/run/docker.sock pull 
.dkr.ecr.us-east-1.amazonaws.com/:latest': exited with 
status 1; stderr='WARNING: Error loading config file:/tmp/28Vd2O/.dockercfg - 
The Auth config file is empty unauthorized: authentication required '
{code}

Checked the source at 
https://github.com/apache/mesos/blob/master/src/docker/docker.cpp, but don't 
understand why above errors message says loading a temp .dockercfg file, which 
doesn't exist btw. I assume mesos should pull the image using the config.json 
file I set to --docker_config, right? 

  was:
We are using AWS ECR as docker registry and using 
https://github.com/awslabs/amazon-ecr-credential-helper to get credential 
automatically. 

As amazon-ecr-credential-helper required, we set a .docker/config.json file 
like below: 
{code}
{
"credsStore": "ecr-login"
{code}
According to the "credsStore" field, docker engine will invoke a 
"docker-credential-ecr-login" command (which we've installed into /usr/bin/) to 
get registry credential whenever required, for example when executing docker 
pull/push. 
This works fine when we tar the .docker/config.json and use uris prarameter to 
pull the tar.gz file for every task using docker image. 

But when I try the new --docker_config option, it doesn't work. The task failed 
to pull the image from ECR. The error message is 
{code}
Failed to launch container: Failed to run 'docker -H 
unix:///var/run/docker.sock pull 
.dkr.ecr.us-east-1.amazonaws.com/:latest': exited with 
status 1; stderr='WARNING: Error loading config file:/tmp/28Vd2O/.dockercfg - 
The Auth config file is empty unauthorized: authentication required '
{code}

Checked the source at 
https://github.com/apache/mesos/blob/master/src/docker/docker.cpp, but don't 
understand why above errors message says loading a temp .dockercfg file, which 
doesn't exist btw. I assume mesos should pull the image using the config.json 
file I set to --docker_config, right? 


> --docker_config doesn't work with amazon-ecr-credential-helper
> --
>
> Key: MESOS-6169
> URL: https://issues.apache.org/jira/browse/MESOS-6169
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Mao Geng
>Assignee: Gilbert Song
>
> We are using AWS ECR as docker registry and using 
> https://github.com/awslabs/amazon-ecr-credential-helper to get credential 
> automatically. 
> As amazon-ecr-credential-helper required, we set a .docker/config.json file 
> like below: 
> {code}
> {
> "credsStore": "ecr-login"
> }
> {code}
> According to the "credsStore" field, docker engine will invoke a 
> "docker-credential-ecr-login" command (which we've installed into /usr/bin/) 
> to get registry credential whenever required, for example when executing 
> docker pull/push. 
> This works fine when we tar the .docker/config.json and use uris prarameter 
> to pull the tar.gz file for every task using docker image. 
> But when I try the new --docker_config option, it doesn't work. The task 
> failed to pull the image from ECR. The error message is 
> {code}
> Failed to launch container: Failed to run 'docker -H 
> unix:///var/run/docker.sock pull 
> .dkr.ecr.us-east-1.amazonaws.com/:latest': exited 
> with status 1; stderr='WARNING: Error loading config 
> file:/tmp/28Vd2O/.dockercfg - The Auth config file is empty unauthorized: 
> authentication required '
> {code}
> Checked the source at 
> https://github.com/apache/mesos/blob/master/src/docker/docker.cpp, but don't 
> understand why above errors message says loading a temp .dockercfg file, 
> which doesn't exist btw. I assume mesos should pull the image using the 
> config.json file I set to --docker_config, right? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504388#comment-15504388
 ] 

Stéphane Cottin edited comment on MESOS-6002 at 9/20/16 6:19 AM:
-

Same issue using overlayfs :

{code}
Failed to remove whiteout file 
'/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq':
 No such file or directory
{code}

Can be reproduced with official postgres, rabbitmq and many more docker images, 
all deleting the same folder in multiple RUN calls.

update: forgot to mention, patched with https://reviews.apache.org/r/51124/ 


was (Author: kaalh):
Same issue using overlayfs :

{code}
Failed to remove whiteout file 
'/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq':
 No such file or directory
{code}

Can be reproduced with official postgres, rabbitmq and many more docker images, 
all deleting the same folder in multiple RUN calls.

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing authenticated