[jira] [Commented] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041306#comment-15041306
 ] 

Michael Park commented on MESOS-4067:
-

I was able to figure out one issue (not sure if there are more issues, or if 
the subsequent failures are all stemmed from this one):

{code}
  // Attempt to unreserve an invalid set of resources (not dynamically
  // reserved), reserve the second set, and launch a task.
  driver.acceptOffers({offer.id()},
  {UNRESERVE(unreserved1),
   RESERVE(dynamicallyReserved2),
   LAUNCH({taskInfo})},
  filters);

  // Wait for TASK_FINISHED update ack.
  AWAIT_READY(statusUpdateAcknowledgement);
  EXPECT_EQ(TASK_FINISHED, statusUpdateAcknowledgement.get().state());

  // In the next offer, expect to find both sets of reserved
  // resources, since the Unreserve operation should fail.
  AWAIT_READY(offers);

  ASSERT_EQ(1u, offers.get().size());
  offer = offers.get()[0];

  EXPECT_TRUE(
  Resources(offer.resources()).contains(
  dynamicallyReserved1 +
  dynamicallyReserved2 +
  unreserved2));
{code}

The intention here seems to be: Perform an {{acceptOffers}} with a sequence of 
operations including a launch task, wait until the launch task has finished and 
therefore the resources recovered. Then expect all of the available resources 
to be offered in a single offer.

The issue is that at 50ms as our {{allocation_interval}}, we can make an offer 
with the available resources while the task is being launched, running, etc. 
This premature offer is picked up by our {{EXPECT_CALL}} for {{resourceOffers}} 
and we don't meet our expectation of receiving an offer with 
{{dynamicallyReserved1 + dynamicallyReserved2 + unreserved2}}.

A few possible approaches in my preferred order:
# We may not need all of these moving parts, and possibly just use one set of 
resources instead of three. Refer to 
{{ReservationTest.ReserveAndLaunchThenUnreserve}} for an example.
# Turn allocation off {{allocation_interval=1000s}} and use {{reviveOffers}} to 
manually control the offers. Refer to 
{{ReservationEndpointsTest.ReserveAvailableAndOfferedResources}} for an example.
# Instead of a simple {{FutureArg<1>(offers)}} as the action for 
{{EXPECT_CALL}} of {{resourceOffers}}, perhaps we can aggregate them instead. 
This one feels like it could get pretty tricky.

[~greggomann], [~jieyu] What are your thoughts?

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-4067:

Shepherd:   (was: Michael Park)

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Michael Park (JIRA)
Michael Park created MESOS-4067:
---

 Summary: ReservationTest.ACLMultipleOperations is flaky
 Key: MESOS-4067
 URL: https://issues.apache.org/jira/browse/MESOS-4067
 Project: Mesos
  Issue Type: Bug
Reporter: Michael Park


Observed from the CI: 
https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3965) Ensure resources in `QuotaInfo` protobuf do not contain `role`

2015-12-04 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041484#comment-15041484
 ] 

Alexander Rukletsov commented on MESOS-3965:


Why not use validation for {{QuotaInfo}} instead? Or do you mean we should 
leverage existing {{internal::master::quota::validation::quotaInfo()}} to 
ensure {{Quota.QuotaInfo}} is always valid?

> Ensure resources in `QuotaInfo` protobuf do not contain `role`
> --
>
> Key: MESOS-3965
> URL: https://issues.apache.org/jira/browse/MESOS-3965
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> {{QuotaInfo}} protobuf currently stores per-role quotas, including 
> {{Resource}} objects. These resources are neither statically nor dynamically 
> reserved, hence they may not contain {{role}} field. We should ensure this 
> field is unset, as well as update validation routine for {{QuotaInfo}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4057) RegistryClientTest suite fails reliably in optimized build

2015-12-04 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041394#comment-15041394
 ] 

Benjamin Bannier commented on MESOS-4057:
-

Triggered by the discussion in MESOS-4055 I tried to reproduce this, and cannot 
reproduce this myself. I guess we can resolve this as {{CANNOT_REPRO}}.

I tried to compile & link libs and tests with different optimization seeting, 
but that wasn't the cause.

> RegistryClientTest suite fails reliably in optimized build
> --
>
> Key: MESOS-4057
> URL: https://issues.apache.org/jira/browse/MESOS-4057
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
>
> Under ubuntu14.04 building 5c0e4dc using gcc-4.8.4-2ubuntu1~14.04 with
> {code}
> % ../configure --enable-ssl --enable-libevent --enable-optimized
> {code}
> all six tests from the {{RegistryClientTest}} suite fail with SIGSEV. The 
> full list of failing tests is
> {code}
> RegistryClientTest.SimpleGetToken
> RegistryClientTest.BadTokenResponse
> RegistryClientTest.SimpleGetManifest
> RegistryClientTest.SimpleGetBlob
> RegistryClientTest.BadRequest
> RegistryClientTest.SimpleRegistryPuller
> {code}
> The failure messages are similar, e.g..
> {code}
> [ RUN  ] RegistryClientTest.BadTokenResponse
> *** Aborted at 1449146245 (unix time) try "date -d @1449146245" if you are 
> using GNU date ***
> PC: @ 0x7f1c5c5ba6ad (unknown)
> *** SIGSEGV (@0xa24888) received by PID 21542 (TID 0x7f1c61f24800) from PID 
> 10635400; stack trace: ***
> @ 0x7f1c5be35340 (unknown)
> @ 0x7f1c5c5ba6ad (unknown)
> @ 0x7f1c5c61932f (unknown)
> @  0x14067aa Try<>::~Try()
> @  0x1406ab0 SSLTest::setup_server()
> @  0x140869b 
> mesos::internal::tests::RegistryClientTest::getServer()
> @  0x13f315a 
> mesos::internal::tests::RegistryClientTest_BadTokenResponse_Test::TestBody()
> @  0x14ec3b0 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x14e728a 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14c8993 testing::Test::Run()
> @  0x14c9116 testing::TestInfo::Run()
> @  0x14c975c testing::TestCase::Run()
> @  0x14cfea4 testing::internal::UnitTestImpl::RunAllTests()
> @  0x14ecfd5 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x14e7e00 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14cec40 testing::UnitTest::Run()
> @   0xd045c4 RUN_ALL_TESTS()
> @   0xd041b1 main
> @ 0x7f1c5ba81ec5 (unknown)
> @   0x930bb9 (unknown)
> Segmentation fault
> {code}
> Even though we do not typically release optimized builds we should still look 
> into these as optimizations tend to expose fragile constructs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4055) SSL-related test fail reliably in optimized build

2015-12-04 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041392#comment-15041392
 ] 

Benjamin Bannier commented on MESOS-4055:
-

I cannot reproduce this myself, I guess we can resolve this as {{CANNOT_REPRO}}.

I tried to compile & link libs and tests with different optimization seeting, 
but that wasn't the cause.

> SSL-related test fail reliably in optimized build
> -
>
> Key: MESOS-4055
> URL: https://issues.apache.org/jira/browse/MESOS-4055
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess, test
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>
> Under ubuntu14.04 building {{5c0e4dc}} using {{gcc-4.8.4-2ubuntu1~14.04}} with
> {code}
> % ../configure --enable-ssl --enable-libevent --enable-optimize
> {code}
> most SSL-related tests fail reliably with SIGSEV. The full list of failing 
> tests is
> {code}
> SSL.Disabled
> SSLTest.BasicSameProcess
> SSLTest.SSLSocket
> SSLTest.NonSSLSocket
> SSLTest.NoVerifyBadCA
> SSLTest.RequireBadCA
> SSLTest.VerifyBadCA
> SSLTest.VerifyCertificate
> SSLTest.RequireCertificate
> SSLTest.ProtocolMismatch
> SSLTest.ValidDowngrade
> SSLtest.NoValidDowngrade
> SSLTest.NoValidDowngrade
> SSLTest.ValidDowngradeEachProtocol
> SSLTest.NoValidDowngradeEachProtocol
> SSLTest.PeerAddress
> SSLTest.HTTPSGet
> SSLTest.HTTPSPost
> {code}
> The test fail with {{SIGSEV}} or similarly worrisome reasons, e.g.,
> {code}
> [ RUN  ] SSLTest.SSLSocket
> *** Aborted at 1449135851 (unix time) try "date -d @1449135851" if you are 
> using GNU date ***
> PC: @   0x4418f4 Try<>::~Try()
> *** SIGSEGV (@0x5acce6) received by PID 29976 (TID 0x7fe601eb5780) from PID 
> 5950694; stack trace: ***
> @ 0x7fe601a9a340 (unknown)
> @   0x4418f4 Try<>::~Try()
> @   0x5a843c SSLTest::setup_server()
> @   0x595162 SSLTest_SSLSocket_Test::TestBody()
> @   0x5f2428 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ec880 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5cd0ff testing::Test::Run()
> @   0x5cd882 testing::TestInfo::Run()
> @   0x5cdec8 testing::TestCase::Run()
> @   0x5d4610 testing::internal::UnitTestImpl::RunAllTests()
> @   0x5f3203 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @   0x5ed5f4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0x5d33ac testing::UnitTest::Run()
> @   0x40fd70 main
> @ 0x7fe600024ec5 (unknown)
> @   0x413eb1 (unknown)
> Segmentation fault
> {code}
> Even though we do not typically release optimized builds we should still look 
> into these as optimizations tend to expose fragile constructs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-04 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-4025:
---

Assignee: Jan Schlicht

> SlaveRecoveryTest/0.GCExecutor is flaky.
> 
>
> Key: MESOS-4025
> URL: https://issues.apache.org/jira/browse/MESOS-4025
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Till Toenshoff
>Assignee: Jan Schlicht
>  Labels: flaky, flaky-test, test
>
> Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based 
> on 0.26.0-rc1.
> Testsuite was run as root.
> {noformat}
> sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1
> {noformat}
> {noformat}
> [ RUN  ] SlaveRecoveryTest/0.GCExecutor
> I1130 16:49:16.336833  1032 exec.cpp:136] Version: 0.26.0
> I1130 16:49:16.345212  1049 exec.cpp:210] Executor registered on slave 
> dde9fd4e-b016-4a99-9081-b047e9df9afa-S0
> Registered executor on ubuntu14
> Starting task 22c63bba-cbf8-46fd-b23a-5409d69e4114
> sh -c 'sleep 1000'
> Forked command at 1057
> ../../src/tests/mesos.cpp:779: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/memory/mesos_test_e5edb2a8-9af3-441f-b991-613082f264e2/slave':
>  Device or resource busy
> *** Aborted at 1448902156 (unix time) try "date -d @1448902156" if you are 
> using GNU date ***
> PC: @  0x1443e9a testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 27364 (TID 0x7f1bfdd2b800) from PID 0; 
> stack trace: ***
> @ 0x7f1be92b80b7 os::Linux::chained_handler()
> @ 0x7f1be92bc219 JVM_handle_linux_signal
> @ 0x7f1bf7bbc340 (unknown)
> @  0x1443e9a testing::UnitTest::AddTestPartResult()
> @  0x1438b99 testing::internal::AssertHelper::operator=()
> @   0xf0b3bb 
> mesos::internal::tests::ContainerizerTest<>::TearDown()
> @  0x1461882 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145c6f8 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x143de4a testing::Test::Run()
> @  0x143e584 testing::TestInfo::Run()
> @  0x143ebca testing::TestCase::Run()
> @  0x1445312 testing::internal::UnitTestImpl::RunAllTests()
> @  0x14624a7 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145d26e 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14440ae testing::UnitTest::Run()
> @   0xd15cd4 RUN_ALL_TESTS()
> @   0xd158c1 main
> @ 0x7f1bf7808ec5 (unknown)
> @   0x913009 (unknown)
> {noformat}
> My Vagrantfile generator;
> {noformat}
> #!/usr/bin/env bash
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.box = "bento/ubuntu-14.04"
>   config.vm.hostname = "${PLATFORM_NAME}"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
> vb.customize ["modifyvm", :id, "--nictype1", "virtio"]
> vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
> vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
>   end
>   config.vm.provision "file", source: "../test.sh", destination: "~/test.sh"
>   config.vm.provision "shell", inline: <<-SHELL
> sudo apt-get update
> sudo apt-get -y install openjdk-7-jdk autoconf libtool
> sudo apt-get -y install build-essential python-dev python-boto  \
> libcurl4-nss-dev libsasl2-dev maven \
> libapr1-dev libsvn-dev libssl-dev libevent-dev
> sudo apt-get -y install git
> sudo wget -qO- https://get.docker.com/ | sh
>   SHELL
> end
> EOF
> {noformat}
> The problem is kicking in frequently in my tests - I'ld say > 10% but less 
> than 50%.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2179) ExamplesTest.NoExecutorFramework terminates with segmentation fault

2015-12-04 Thread Bernd Mathiske (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-2179:
--
Labels: flaky mesosphere  (was: flaky)

> ExamplesTest.NoExecutorFramework terminates with segmentation fault
> ---
>
> Key: MESOS-2179
> URL: https://issues.apache.org/jira/browse/MESOS-2179
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
> Environment: Centos7 inside Docker
> Mesos master commit: 49d4553a0645624179f17ed6da8d2443e88998bf
>Reporter: Cody Maloney
>Priority: Minor
>  Labels: flaky, mesosphere
>
> {code}
> [ RUN  ] ExamplesTest.NoExecutorFramework
> ../../src/tests/script.cpp:83: Failure
> Failed
> no_executor_framework_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.NoExecutorFramework (2543 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-04 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041534#comment-15041534
 ] 

Jan Schlicht commented on MESOS-4025:
-

Comparing {{HealthCheckTest.ROOT_DOCKER_*}} tests with e.g. 
{{DockerContainerizerTest.ROOT_DOCKER_*}} tests, the difference is that the 
ones in {{HealthCheckTest}} don't use mocks and futures to ensure the 
termination of the started container. Not doing this probably leaves some 
artifacts (in form of locked cgroups) that aren't cleaned up when 
{{GCExecutor}} runs and result in the crash during tear down of the test.

> SlaveRecoveryTest/0.GCExecutor is flaky.
> 
>
> Key: MESOS-4025
> URL: https://issues.apache.org/jira/browse/MESOS-4025
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Till Toenshoff
>Assignee: Jan Schlicht
>  Labels: flaky, flaky-test, test
>
> Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based 
> on 0.26.0-rc1.
> Testsuite was run as root.
> {noformat}
> sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1
> {noformat}
> {noformat}
> [ RUN  ] SlaveRecoveryTest/0.GCExecutor
> I1130 16:49:16.336833  1032 exec.cpp:136] Version: 0.26.0
> I1130 16:49:16.345212  1049 exec.cpp:210] Executor registered on slave 
> dde9fd4e-b016-4a99-9081-b047e9df9afa-S0
> Registered executor on ubuntu14
> Starting task 22c63bba-cbf8-46fd-b23a-5409d69e4114
> sh -c 'sleep 1000'
> Forked command at 1057
> ../../src/tests/mesos.cpp:779: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/memory/mesos_test_e5edb2a8-9af3-441f-b991-613082f264e2/slave':
>  Device or resource busy
> *** Aborted at 1448902156 (unix time) try "date -d @1448902156" if you are 
> using GNU date ***
> PC: @  0x1443e9a testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 27364 (TID 0x7f1bfdd2b800) from PID 0; 
> stack trace: ***
> @ 0x7f1be92b80b7 os::Linux::chained_handler()
> @ 0x7f1be92bc219 JVM_handle_linux_signal
> @ 0x7f1bf7bbc340 (unknown)
> @  0x1443e9a testing::UnitTest::AddTestPartResult()
> @  0x1438b99 testing::internal::AssertHelper::operator=()
> @   0xf0b3bb 
> mesos::internal::tests::ContainerizerTest<>::TearDown()
> @  0x1461882 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145c6f8 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x143de4a testing::Test::Run()
> @  0x143e584 testing::TestInfo::Run()
> @  0x143ebca testing::TestCase::Run()
> @  0x1445312 testing::internal::UnitTestImpl::RunAllTests()
> @  0x14624a7 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145d26e 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14440ae testing::UnitTest::Run()
> @   0xd15cd4 RUN_ALL_TESTS()
> @   0xd158c1 main
> @ 0x7f1bf7808ec5 (unknown)
> @   0x913009 (unknown)
> {noformat}
> My Vagrantfile generator;
> {noformat}
> #!/usr/bin/env bash
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.box = "bento/ubuntu-14.04"
>   config.vm.hostname = "${PLATFORM_NAME}"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
> vb.customize ["modifyvm", :id, "--nictype1", "virtio"]
> vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
> vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
>   end
>   config.vm.provision "file", source: "../test.sh", destination: "~/test.sh"
>   config.vm.provision "shell", inline: <<-SHELL
> sudo apt-get update
> sudo apt-get -y install openjdk-7-jdk autoconf libtool
> sudo apt-get -y install build-essential python-dev python-boto  \
> libcurl4-nss-dev libsasl2-dev maven \
> libapr1-dev libsvn-dev libssl-dev libevent-dev
> sudo apt-get -y install git
> sudo wget -qO- https://get.docker.com/ | sh
>   SHELL
> end
> EOF
> {noformat}
> The problem is kicking in frequently in my tests - I'ld say > 10% but less 
> than 50%.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-04 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-4025:

Sprint: Mesosphere Sprint 23

> SlaveRecoveryTest/0.GCExecutor is flaky.
> 
>
> Key: MESOS-4025
> URL: https://issues.apache.org/jira/browse/MESOS-4025
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Till Toenshoff
>Assignee: Jan Schlicht
>  Labels: flaky, flaky-test, test
>
> Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based 
> on 0.26.0-rc1.
> Testsuite was run as root.
> {noformat}
> sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1
> {noformat}
> {noformat}
> [ RUN  ] SlaveRecoveryTest/0.GCExecutor
> I1130 16:49:16.336833  1032 exec.cpp:136] Version: 0.26.0
> I1130 16:49:16.345212  1049 exec.cpp:210] Executor registered on slave 
> dde9fd4e-b016-4a99-9081-b047e9df9afa-S0
> Registered executor on ubuntu14
> Starting task 22c63bba-cbf8-46fd-b23a-5409d69e4114
> sh -c 'sleep 1000'
> Forked command at 1057
> ../../src/tests/mesos.cpp:779: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/memory/mesos_test_e5edb2a8-9af3-441f-b991-613082f264e2/slave':
>  Device or resource busy
> *** Aborted at 1448902156 (unix time) try "date -d @1448902156" if you are 
> using GNU date ***
> PC: @  0x1443e9a testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 27364 (TID 0x7f1bfdd2b800) from PID 0; 
> stack trace: ***
> @ 0x7f1be92b80b7 os::Linux::chained_handler()
> @ 0x7f1be92bc219 JVM_handle_linux_signal
> @ 0x7f1bf7bbc340 (unknown)
> @  0x1443e9a testing::UnitTest::AddTestPartResult()
> @  0x1438b99 testing::internal::AssertHelper::operator=()
> @   0xf0b3bb 
> mesos::internal::tests::ContainerizerTest<>::TearDown()
> @  0x1461882 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145c6f8 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x143de4a testing::Test::Run()
> @  0x143e584 testing::TestInfo::Run()
> @  0x143ebca testing::TestCase::Run()
> @  0x1445312 testing::internal::UnitTestImpl::RunAllTests()
> @  0x14624a7 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145d26e 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14440ae testing::UnitTest::Run()
> @   0xd15cd4 RUN_ALL_TESTS()
> @   0xd158c1 main
> @ 0x7f1bf7808ec5 (unknown)
> @   0x913009 (unknown)
> {noformat}
> My Vagrantfile generator;
> {noformat}
> #!/usr/bin/env bash
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.box = "bento/ubuntu-14.04"
>   config.vm.hostname = "${PLATFORM_NAME}"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
> vb.customize ["modifyvm", :id, "--nictype1", "virtio"]
> vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
> vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
>   end
>   config.vm.provision "file", source: "../test.sh", destination: "~/test.sh"
>   config.vm.provision "shell", inline: <<-SHELL
> sudo apt-get update
> sudo apt-get -y install openjdk-7-jdk autoconf libtool
> sudo apt-get -y install build-essential python-dev python-boto  \
> libcurl4-nss-dev libsasl2-dev maven \
> libapr1-dev libsvn-dev libssl-dev libevent-dev
> sudo apt-get -y install git
> sudo wget -qO- https://get.docker.com/ | sh
>   SHELL
> end
> EOF
> {noformat}
> The problem is kicking in frequently in my tests - I'ld say > 10% but less 
> than 50%.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-04 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041534#comment-15041534
 ] 

Jan Schlicht edited comment on MESOS-4025 at 12/4/15 1:01 PM:
--

Comparing {{HealthCheckTest.ROOT_DOCKER_\*}} tests with e.g. 
{{DockerContainerizerTest.ROOT_DOCKER_\*}} tests, the difference is that the 
ones in {{HealthCheckTest}} don't use mocks and futures to ensure the 
termination of the started container. Not doing this probably leaves some 
artifacts (in form of locked cgroups) that aren't cleaned up when 
{{GCExecutor}} runs and result in the crash during tear down of the test.


was (Author: nfnt):
Comparing {{HealthCheckTest.ROOT_DOCKER_*}} tests with e.g. 
{{DockerContainerizerTest.ROOT_DOCKER_*}} tests, the difference is that the 
ones in {{HealthCheckTest}} don't use mocks and futures to ensure the 
termination of the started container. Not doing this probably leaves some 
artifacts (in form of locked cgroups) that aren't cleaned up when 
{{GCExecutor}} runs and result in the crash during tear down of the test.

> SlaveRecoveryTest/0.GCExecutor is flaky.
> 
>
> Key: MESOS-4025
> URL: https://issues.apache.org/jira/browse/MESOS-4025
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Till Toenshoff
>Assignee: Jan Schlicht
>  Labels: flaky, flaky-test, test
>
> Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based 
> on 0.26.0-rc1.
> Testsuite was run as root.
> {noformat}
> sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1
> {noformat}
> {noformat}
> [ RUN  ] SlaveRecoveryTest/0.GCExecutor
> I1130 16:49:16.336833  1032 exec.cpp:136] Version: 0.26.0
> I1130 16:49:16.345212  1049 exec.cpp:210] Executor registered on slave 
> dde9fd4e-b016-4a99-9081-b047e9df9afa-S0
> Registered executor on ubuntu14
> Starting task 22c63bba-cbf8-46fd-b23a-5409d69e4114
> sh -c 'sleep 1000'
> Forked command at 1057
> ../../src/tests/mesos.cpp:779: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/memory/mesos_test_e5edb2a8-9af3-441f-b991-613082f264e2/slave':
>  Device or resource busy
> *** Aborted at 1448902156 (unix time) try "date -d @1448902156" if you are 
> using GNU date ***
> PC: @  0x1443e9a testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 27364 (TID 0x7f1bfdd2b800) from PID 0; 
> stack trace: ***
> @ 0x7f1be92b80b7 os::Linux::chained_handler()
> @ 0x7f1be92bc219 JVM_handle_linux_signal
> @ 0x7f1bf7bbc340 (unknown)
> @  0x1443e9a testing::UnitTest::AddTestPartResult()
> @  0x1438b99 testing::internal::AssertHelper::operator=()
> @   0xf0b3bb 
> mesos::internal::tests::ContainerizerTest<>::TearDown()
> @  0x1461882 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145c6f8 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x143de4a testing::Test::Run()
> @  0x143e584 testing::TestInfo::Run()
> @  0x143ebca testing::TestCase::Run()
> @  0x1445312 testing::internal::UnitTestImpl::RunAllTests()
> @  0x14624a7 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145d26e 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14440ae testing::UnitTest::Run()
> @   0xd15cd4 RUN_ALL_TESTS()
> @   0xd158c1 main
> @ 0x7f1bf7808ec5 (unknown)
> @   0x913009 (unknown)
> {noformat}
> My Vagrantfile generator;
> {noformat}
> #!/usr/bin/env bash
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.box = "bento/ubuntu-14.04"
>   config.vm.hostname = "${PLATFORM_NAME}"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
> vb.customize ["modifyvm", :id, "--nictype1", "virtio"]
> vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
> vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
>   end
>   config.vm.provision "file", source: "../test.sh", destination: "~/test.sh"
>   config.vm.provision "shell", inline: <<-SHELL
> sudo apt-get update
> sudo apt-get -y install openjdk-7-jdk autoconf libtool
> sudo apt-get -y install build-essential python-dev python-boto  \
> 

[jira] [Commented] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041794#comment-15041794
 ] 

Joseph Wu commented on MESOS-4067:
--

What about pausing the clock and manually controlling when the 
{{allocation_interval}} passes?

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3718) Implement Quota support in allocator

2015-12-04 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038424#comment-15038424
 ] 

Joris Van Remoortere edited comment on MESOS-3718 at 12/4/15 5:47 PM:
--

{code}
commit ffa8392d71b13f5c65e64acbf433db3bb9257f31
Author: Alexander Rukletsov 
Date:   Fri Dec 4 12:33:30 2015 -0500

Quota: Filtered revocable resources out of quotaRoleSorter in allocator.

Review: https://reviews.apache.org/r/40821
commit b226be0bcce0ab45334ad58ac56ea72d6af6995d
Author: Alexander Rukletsov 
Date:   Thu Dec 3 11:45:33 2015 -0500

Quota: Updated allocate() in the hierarchical allocator.

Quota is satisfied in a separate loop over agents. A running total is
maintained as an exit criterion for the WDRF allocation stage.

Precursory version: https://reviews.apache.org/r/39401/

Review: https://reviews.apache.org/r/40551
{code}


was (Author: jvanremoortere):
{code}
commit b226be0bcce0ab45334ad58ac56ea72d6af6995d
Author: Alexander Rukletsov 
Date:   Thu Dec 3 11:45:33 2015 -0500

Quota: Updated allocate() in the hierarchical allocator.

Quota is satisfied in a separate loop over agents. A running total is
maintained as an exit criterion for the WDRF allocation stage.

Precursory version: https://reviews.apache.org/r/39401/

Review: https://reviews.apache.org/r/40551
{code}

> Implement Quota support in allocator
> 
>
> Key: MESOS-3718
> URL: https://issues.apache.org/jira/browse/MESOS-3718
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> The built-in Hierarchical DRF allocator should support Quota. This includes 
> (but not limited to): adding, updating, removing and satisfying quota; 
> avoiding both overcomitting resources and handing them to non-quota'ed roles 
> in presence of master failover.
> A [design doc for Quota support in 
> Allocator|https://issues.apache.org/jira/browse/MESOS-2937] provides an 
> overview of a feature set required to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4059) Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters

2015-12-04 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041812#comment-15041812
 ] 

Joris Van Remoortere commented on MESOS-4059:
-

{code}
commit fe4be25fa6011787751547b06f70676fd79bb87b
Author: Neil Conway 
Date:   Fri Dec 4 11:54:18 2015 -0500

Fixed flakiness in MasterMaintenanceTest.InverseOffersFilters.

There were two problems:

(1) After launching two tasks, we assumed that we would see TASK_RUNNING
updates for the tasks in the same order they were launched. This is
not guaranteed, so adjust the test to handle TASK_RUNNING updates in
the order they are received.

(2) The test used this pattern:

Mesos m;
Call c;

m.send(c);
Clock::settle();
// Trigger a new batch allocation that reflects the call
Clock::advance();

However, this is actually unsafe (see MESOS-3760): the send() call
might not have reached the master by the time `Clock::settle()`
happens. This was fixed by blocking using `FUTURE_DISPATCH` on the
downstream logic in the allocator that is invoked to handle the
delivered event.

Review: https://reviews.apache.org/r/40935
{code}

> Investigate remaining flakiness in MasterMaintenanceTest.InverseOffersFilters
> -
>
> Key: MESOS-4059
> URL: https://issues.apache.org/jira/browse/MESOS-4059
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Minor
>  Labels: flaky-test, mesosphere
>
> Per comments in MESOS-3916, the fix for that issue decreased the degree of 
> flakiness, but it seems that some intermittent test failures do occur -- 
> should be investigated.
> *Flakiness in task acknowledgment*
> {code}
> I1203 18:25:04.609817 28732 status_update_manager.cpp:392] Received status 
> update acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> W1203 18:25:04.610076 28732 status_update_manager.cpp:762] Unexpected status 
> update acknowledgement (received 6afd012e-8e88-41b2-8239-a9b852d07ca1, 
> expecting 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for update TASK_RUNNING 
> (UUID: 82fc7a7b-e64a-4f4d-ab74-76abac42b4e6) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-
> E1203 18:25:04.610339 28736 slave.cpp:2339] Failed to handle status update 
> acknowledgement (UUID: 6afd012e-8e88-41b2-8239-a9b852d07ca1) for task 
> 26305fdd-edb0-4764-8b8a-2558f2b2d81b of framework 
> c7900911-cc7a-4dde-92e7-48fe82cddd9e-: Duplicate acknowledgemen
> {code}
> This is a race between [launching and acknowledging two 
> tasks|https://github.com/apache/mesos/blob/75cb89fa961b249c9ab7fa0f45dfa9d415a5/src/tests/master_maintenance_tests.cpp#L1486-L1517].
>   The status updates for each task are not necessarily received in the same 
> order as launching the tasks.
> *Flakiness in first inverse offer filter*
> See [this comment in 
> MESOS-3916|https://issues.apache.org/jira/browse/MESOS-3916?focusedCommentId=15027478=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15027478]
>  for the explanation.  The related logs are above the comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3981) Implement recovery in the Hierarchical allocator

2015-12-04 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039580#comment-15039580
 ] 

Joris Van Remoortere edited comment on MESOS-3981 at 12/4/15 4:44 PM:
--

{code}
commit 5107e68e7bfb0e8fb420445107c371b0190de235
Author: Alexander Rukletsov 
Date:   Fri Dec 4 11:32:16 2015 -0500

Quota: Properly initialized the sorter for quota'ed roles in the allocator.

Addendum to b9762afeb851662fc6122e177abf1c1a4f4921ae.

Review: https://reviews.apache.org/r/40961
commit b9762afeb851662fc6122e177abf1c1a4f4921ae
Author: Alexander Rukletsov 
Date:   Thu Dec 3 20:28:04 2015 -0500

Quota: Properly initialized the allocator sorter for quota'ed roles.

Review: https://reviews.apache.org/r/40795

commit 313591942cce80a0c8b3b91ef115ed5295d2b891
Author: Alexander Rukletsov 
Date:   Thu Dec 3 20:22:55 2015 -0500

Quota: Implemented recovery in hierarchical allocator.

Review: https://reviews.apache.org/r/40332
{code}


was (Author: jvanremoortere):
{code}
commit b9762afeb851662fc6122e177abf1c1a4f4921ae
Author: Alexander Rukletsov 
Date:   Thu Dec 3 20:28:04 2015 -0500

Quota: Properly initialized the allocator sorter for quota'ed roles.

Review: https://reviews.apache.org/r/40795

commit 313591942cce80a0c8b3b91ef115ed5295d2b891
Author: Alexander Rukletsov 
Date:   Thu Dec 3 20:22:55 2015 -0500

Quota: Implemented recovery in hierarchical allocator.

Review: https://reviews.apache.org/r/40332
{code}

> Implement recovery in the Hierarchical allocator
> 
>
> Key: MESOS-3981
> URL: https://issues.apache.org/jira/browse/MESOS-3981
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> The built-in Hierarchical allocator should implement the recovery (in the 
> presence of quota).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-04 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041718#comment-15041718
 ] 

Bernd Mathiske commented on MESOS-4065:
---

Can you please explain for the rest of us what you are seeing in the log output?

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-04 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041767#comment-15041767
 ] 

Till Toenshoff commented on MESOS-4065:
---

What we see there is the fact that two processes (slave + executor) both use 
the same fd (10u) which likely is a bug.

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-4067:


Assignee: Greg Mann

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4068) Multiple log coordinators can be elected concurrently

2015-12-04 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042078#comment-15042078
 ] 

Jie Yu commented on MESOS-4068:
---

Looked at the test case. It's possible that coord1 was elected first and then 
demoted by coord3 who got elected later. In that case, num_elected will be 2.

> Multiple log coordinators can be elected concurrently
> -
>
> Key: MESOS-4068
> URL: https://issues.apache.org/jira/browse/MESOS-4068
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
> Attachments: rep_log_multiple_coord_test_case-1.patch
>
>
> Attached is a test case (reduced from a test case submitted by Michael Maged 
> at IBM).
> Running {{mesos-tests --gtest_filter="CoordinatorTest.SimulationDriver" 
> --gtest_repeat=1000}} consistently fails in < 100 iterations on a CentOS 7 
> VM. However, it only failed ~once in 12,000 iterations on the Mac OS X 10.10 
> host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3358) Add TaskStatus label decorator hooks for Master

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042276#comment-15042276
 ] 

Niklas Quarfot Nielsen commented on MESOS-3358:
---

[~karya] Ping ^^. Is work on this coming up? If not, we should probably 
unassign ourselves from the ticket until we have a path forward.

> Add TaskStatus label decorator hooks for Master
> ---
>
> Key: MESOS-3358
> URL: https://issues.apache.org/jira/browse/MESOS-3358
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> The hook will be triggered when Master receives TaskStatus message from Agent 
> or when the Master itself generates a TASK_LOST status. The hook should also 
> provide a list of the previous TaskStatuses to the module.
> The use case is to allow a "cleanup" module to release IPs if an agent is 
> lost. The previous statuses will contain the IP address(es) to be released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.contains(slaveId))

2015-12-04 Thread Mandeep Chadha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mandeep Chadha updated MESOS-4071:
--
Description: 
Stack Trace :

NOTE : Replaced IP address with XX.XX.XX.XX 

I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
schedulerc8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
total.resources.contains(slaveId)
*** Check failure stack trace: ***
@ 0x7f2b3dda53d8  google::LogMessage::Fail()
@ 0x7f2b3dda5327  google::LogMessage::SendToLog()
@ 0x7f2b3dda4d38  google::LogMessage::Flush()
@ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2b3d3351a1  
mesos::internal::master::allocator::DRFSorter::remove()
@ 0x7f2b3d0b8c29  
mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
@ 0x7f2b3d0ca823 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
@ 0x7f2b3d0dc8dc  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
_
@ 0x7f2b3dd2cc35  std::function<>::operator()()
@ 0x7f2b3dd15ae5  process::ProcessBase::visit()
@ 0x7f2b3dd188e2  process::DispatchEvent::visit()
@   0x472366  process::ProcessBase::serve()
@ 0x7f2b3dd1203f  process::ProcessManager::resume()
@ 0x7f2b3dd061b2  process::internal::schedule()
@ 0x7f2b3dd63efd  _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
x_tupleIJXspT_EEE
@ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
@ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
@   0x318c2b6470  (unknown)
@   0x318b2079d1  (unknown)
@   0x318aae8b5d  (unknown)
@  (nil)  (unknown)
Aborted (core dumped)


  was:
Stack Trace :

NOTE : Replaced IP address with XX.XX.XX.XX 

I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a8
65c672fcbb (TEST)) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: total.resources.cont
ains(slaveId)
*** Check failure stack trace: ***
@ 0x7f2b3dda53d8  google::LogMessage::Fail()
@ 0x7f2b3dda5327  google::LogMessage::SendToLog()
@ 0x7f2b3dda4d38  google::LogMessage::Flush()
@ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2b3d3351a1  mesos::internal::master::allocator::DRFSorter::remove(
)
@ 0x7f2b3d0b8c29  mesos::internal::master::allocator::HierarchicalAlloca
torProcess<>::removeFramework()
@ 0x7f2b3d0ca823  
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
@ 0x7f2b3d0dc8dc  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
_
@ 0x7f2b3dd2cc35  std::function<>::operator()()
@ 0x7f2b3dd15ae5  process::ProcessBase::visit()
@ 0x7f2b3dd188e2  process::DispatchEvent::visit()
@   0x472366  process::ProcessBase::serve()
@ 0x7f2b3dd1203f  process::ProcessManager::resume()
@ 0x7f2b3dd061b2  process::internal::schedule()
@ 0x7f2b3dd63efd  _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
x_tupleIJXspT_EEE
@ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
@ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
@   0x318c2b6470  (unknown)
@   0x318b2079d1  (unknown)
@   0x318aae8b5d  (unknown)
@  (nil)  

[jira] [Commented] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.contains(slaveId))

2015-12-04 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042388#comment-15042388
 ] 

James Peach commented on MESOS-4071:


[~jvanremoortere] FYI this is the issue we talked about the other day.

> Master crash during framework teardown ( Check failed: 
> total.resources.contains(slaveId))
> -
>
> Key: MESOS-4071
> URL: https://issues.apache.org/jira/browse/MESOS-4071
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Mandeep Chadha
>
> Stack Trace :
> NOTE : Replaced IP address with XX.XX.XX.XX 
> {code}
> I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
> framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
> scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
> schedulerc8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
> total.resources.contains(slaveId)
> *** Check failure stack trace: ***
> @ 0x7f2b3dda53d8  google::LogMessage::Fail()
> @ 0x7f2b3dda5327  google::LogMessage::SendToLog()
> @ 0x7f2b3dda4d38  google::LogMessage::Flush()
> @ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2b3d3351a1  
> mesos::internal::master::allocator::DRFSorter::remove()
> @ 0x7f2b3d0b8c29  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> @ 0x7f2b3d0ca823 
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
> @ 0x7f2b3d0dc8dc  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
> _
> @ 0x7f2b3dd2cc35  std::function<>::operator()()
> @ 0x7f2b3dd15ae5  process::ProcessBase::visit()
> @ 0x7f2b3dd188e2  process::DispatchEvent::visit()
> @   0x472366  process::ProcessBase::serve()
> @ 0x7f2b3dd1203f  process::ProcessManager::resume()
> @ 0x7f2b3dd061b2  process::internal::schedule()
> @ 0x7f2b3dd63efd  
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
> x_tupleIJXspT_EEE
> @ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
> @ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
> @   0x318c2b6470  (unknown)
> @   0x318b2079d1  (unknown)
> @   0x318aae8b5d  (unknown)
> @  (nil)  (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.contains(slaveId))

2015-12-04 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-4071:
---
Description: 
Stack Trace :

NOTE : Replaced IP address with XX.XX.XX.XX 

{code}
I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
schedulerc8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
total.resources.contains(slaveId)
*** Check failure stack trace: ***
@ 0x7f2b3dda53d8  google::LogMessage::Fail()
@ 0x7f2b3dda5327  google::LogMessage::SendToLog()
@ 0x7f2b3dda4d38  google::LogMessage::Flush()
@ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2b3d3351a1  
mesos::internal::master::allocator::DRFSorter::remove()
@ 0x7f2b3d0b8c29  
mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
@ 0x7f2b3d0ca823 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
@ 0x7f2b3d0dc8dc  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
_
@ 0x7f2b3dd2cc35  std::function<>::operator()()
@ 0x7f2b3dd15ae5  process::ProcessBase::visit()
@ 0x7f2b3dd188e2  process::DispatchEvent::visit()
@   0x472366  process::ProcessBase::serve()
@ 0x7f2b3dd1203f  process::ProcessManager::resume()
@ 0x7f2b3dd061b2  process::internal::schedule()
@ 0x7f2b3dd63efd  _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
x_tupleIJXspT_EEE
@ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
@ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
@   0x318c2b6470  (unknown)
@   0x318b2079d1  (unknown)
@   0x318aae8b5d  (unknown)
@  (nil)  (unknown)
Aborted (core dumped)
{code}


  was:
Stack Trace :

NOTE : Replaced IP address with XX.XX.XX.XX 

I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST)) at 
schedulerc8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
total.resources.contains(slaveId)
*** Check failure stack trace: ***
@ 0x7f2b3dda53d8  google::LogMessage::Fail()
@ 0x7f2b3dda5327  google::LogMessage::SendToLog()
@ 0x7f2b3dda4d38  google::LogMessage::Flush()
@ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2b3d3351a1  
mesos::internal::master::allocator::DRFSorter::remove()
@ 0x7f2b3d0b8c29  
mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
@ 0x7f2b3d0ca823 
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
@ 0x7f2b3d0dc8dc  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
_
@ 0x7f2b3dd2cc35  std::function<>::operator()()
@ 0x7f2b3dd15ae5  process::ProcessBase::visit()
@ 0x7f2b3dd188e2  process::DispatchEvent::visit()
@   0x472366  process::ProcessBase::serve()
@ 0x7f2b3dd1203f  process::ProcessManager::resume()
@ 0x7f2b3dd061b2  process::internal::schedule()
@ 0x7f2b3dd63efd  _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
x_tupleIJXspT_EEE
@ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
@ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
@   0x318c2b6470  (unknown)
@   0x318b2079d1  (unknown)
@   0x318aae8b5d  (unknown)
@  

[jira] [Updated] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.cont ains(slaveId))

2015-12-04 Thread Mandeep Chadha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mandeep Chadha updated MESOS-4071:
--
Description: 
Stack Trace :

NOTE : Replaced IP address with XX.XX.XX.XX 

I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a8
65c672fcbb (TEST)) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: total.resources.cont
ains(slaveId)
*** Check failure stack trace: ***
@ 0x7f2b3dda53d8  google::LogMessage::Fail()
@ 0x7f2b3dda5327  google::LogMessage::SendToLog()
@ 0x7f2b3dda4d38  google::LogMessage::Flush()
@ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2b3d3351a1  mesos::internal::master::allocator::DRFSorter::remove(
)
@ 0x7f2b3d0b8c29  mesos::internal::master::allocator::HierarchicalAlloca
torProcess<>::removeFramework()
@ 0x7f2b3d0ca823  
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
@ 0x7f2b3d0dc8dc  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
_
@ 0x7f2b3dd2cc35  std::function<>::operator()()
@ 0x7f2b3dd15ae5  process::ProcessBase::visit()
@ 0x7f2b3dd188e2  process::DispatchEvent::visit()
@   0x472366  process::ProcessBase::serve()
@ 0x7f2b3dd1203f  process::ProcessManager::resume()
@ 0x7f2b3dd061b2  process::internal::schedule()
@ 0x7f2b3dd63efd  _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
x_tupleIJXspT_EEE
@ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
@ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
@   0x318c2b6470  (unknown)
@   0x318b2079d1  (unknown)
@   0x318aae8b5d  (unknown)
@  (nil)  (unknown)
Aborted (core dumped)


  was:

Stack Trace :

NOTE : Replaced IP address with XX.XX.XX.XX 

I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a8
65c672fcbb (TEST)) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: total.resources.cont
ains(slaveId)
*** Check failure stack trace: ***
@ 0x7f2b3dda53d8  google::LogMessage::Fail()
@ 0x7f2b3dda5327  google::LogMessage::SendToLog()
@ 0x7f2b3dda4d38  google::LogMessage::Flush()
@ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2b3d3351a1  mesos::internal::master::allocator::DRFSorter::remove(
)
@ 0x7f2b3d0b8c29  mesos::internal::master::allocator::HierarchicalAlloca
torProcess<>::removeFramework()
@ 0x7f2b3d0ca823  _ZZN7process8dispatchIN5mesos8internal6master9allocato
r21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_EN
KUlPNS_11ProcessBaseEE_clESJ_
@ 0x7f2b3d0dc8dc  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZN
S0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11Fr
ameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
_
@ 0x7f2b3dd2cc35  std::function<>::operator()()
@ 0x7f2b3dd15ae5  process::ProcessBase::visit()
@ 0x7f2b3dd188e2  process::DispatchEvent::visit()
@   0x472366  process::ProcessBase::serve()
@ 0x7f2b3dd1203f  process::ProcessManager::resume()
@ 0x7f2b3dd061b2  process::internal::schedule()
@ 0x7f2b3dd63efd  _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
x_tupleIJXspT_EEE
@ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
@ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
@   0x318c2b6470  (unknown)
@   0x318b2079d1  (unknown)
@   0x318aae8b5d  (unknown)
@  

[jira] [Commented] (MESOS-4069) libevent_ssl_socket assertion fails

2015-12-04 Thread Jojy Varghese (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042238#comment-15042238
 ] 

Jojy Varghese commented on MESOS-4069:
--

Adding tcpdump analysis:

Tcp dump file: 
https://drive.google.com/file/d/0B-aoVvwDtYZNWGdWbVFZdUN3R0k/view?usp=sharing

- Trace shows that the slave socket at 192.168.87.237:39287 sent a RST back for 
a long running streaming download (see captured frame #16959) most likely due 
to the assertion described in the issue.
- The frames received does not show a 0 length received. 

Do we understand all the circumstance in which *bufferevent_read*  will return 
a 0?

> libevent_ssl_socket assertion fails 
> 
>
> Key: MESOS-4069
> URL: https://issues.apache.org/jira/browse/MESOS-4069
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
> Environment: ubuntu 14.04
>Reporter: Jojy Varghese
>Assignee: Jojy Varghese
>
> Have been seeing the following socket  receive error frequently:
> {code}
> F1204 11:12:47.301839 54104 libevent_ssl_socket.cpp:245] Check failed: length 
> > 0 
> *** Check failure stack trace: ***
> @ 0x7f73227fe5a6  google::LogMessage::Fail()
> @ 0x7f73227fe4f2  google::LogMessage::SendToLog()
> @ 0x7f73227fdef4  google::LogMessage::Flush()
> @ 0x7f7322800e08  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f73227b93e2  
> process::network::LibeventSSLSocketImpl::recv_callback()
> @ 0x7f73227b9182  
> process::network::LibeventSSLSocketImpl::recv_callback()
> @ 0x7f731cbc75cc  bufferevent_run_deferred_callbacks_locked
> @ 0x7f731cbbdc5d  event_base_loop
> @ 0x7f73227d9ded  process::EventLoop::run()
> @ 0x7f73227a3101  
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7f73227a305b  std::_Bind_simple<>::operator()()
> @ 0x7f73227a2ff4  std::thread::_Impl<>::_M_run()
> @ 0x7f731e0d1a40  (unknown)
> @ 0x7f731de0a182  start_thread
> @ 0x7f731db3730d  (unknown)
> @  (nil)  (unknown)
> {code}
> In this case this was a HTTP get over SSL. The url being:
> https://dseasb33srnrn.cloudfront.net:443/registry-v2/docker/registry/v2/blobs/sha256/44/44be94a95984bb47dc3a193f59bf8c04d5e877160b745b119278f38753a6f58f/data?Expires=1449259252=Q4CQdr1LbxsiYyVebmetrx~lqDgQfHVkGxpbMM3PoISn6r07DXIzBX6~tl1iZx9uXdfr~5awH8Kxwh-y8b0dTV3mLTZAVlneZlHbhBAX9qbYMd180-QvUvrFezwOlSmX4B3idvo-zK0CarUu3Ev1hbJz5y3olwe2ZC~RXHEwzkQ_=APKAJECH5M7VWIS5YZ6Q
> *Steps to reproduce:*
> 1. Run master
> 2. Run slave from your build directory as  as:
> {code}
>  
> GLOG_v=1;SSL_ENABLED=1;SSL_KEY_FILE=;SSL_CERT_FILE=;sudo
>  -E ./bin/mesos-slave.sh \
>   --master=127.0.0.1:5050 \   
>
>   --executor_registration_timeout=5mins \ 
>
>   --containerizers=mesos  \   
>
>   --isolation=filesystem/linux \  
>
>   --image_providers=DOCKER  \ 
>
>   --docker_puller_timeout=600 \   
>
>   --launcher_dir=$MESOS_BUILD_DIR/src/.libs \ 
>
>   --switch_user="false" \ 
>
>   --docker_puller="registry"  
> {code} 
> 3. Run mesos-execute from your build directory as :
> {code}
> ./src/mesos-execute \ 
>
> --master=127.0.0.1:5050 \ 
>
> --command="uname -a"  \   
>
> --name=test \ 
>
> --docker_image=ubuntu 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3907) PersistentVolumeTest.SlaveRecovery test is flaky

2015-12-04 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-3907:
-

Assignee: Jie Yu

> PersistentVolumeTest.SlaveRecovery test is flaky
> 
>
> Key: MESOS-3907
> URL: https://issues.apache.org/jira/browse/MESOS-3907
> Project: Mesos
>  Issue Type: Bug
> Environment: ASF CI
>Reporter: Vinod Kone
>Assignee: Jie Yu
>
> Looks like the executor didn't re-register in time after slave restart. The 
> Clock::settle() should've guaranteed that the executor sent re-register and 
> slave processed it, so it's not clear why this race happened.
> {noformat: title=Good Run}
> [ RUN  ] PersistentVolumeTest.SlaveRecovery
> I1112 17:53:29.746028 29676 leveldb.cpp:176] Opened db in 2.430372ms
> I1112 17:53:29.746974 29676 leveldb.cpp:183] Compacted db in 832726ns
> I1112 17:53:29.747094 29676 leveldb.cpp:198] Created db iterator in 24074ns
> I1112 17:53:29.747210 29676 leveldb.cpp:204] Seeked to beginning of db in 
> 2533ns
> I1112 17:53:29.747329 29676 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 786ns
> I1112 17:53:29.747462 29676 replica.cpp:780] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1112 17:53:29.748114 29699 recover.cpp:449] Starting replica recovery
> I1112 17:53:29.748440 29696 recover.cpp:475] Replica is in EMPTY status
> I1112 17:53:29.749609 29708 replica.cpp:676] Replica in EMPTY status received 
> a broadcasted recover request from (5641)@172.17.6.242:48157
> I1112 17:53:29.750352 29702 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1112 17:53:29.750880 29707 recover.cpp:566] Updating replica status to 
> STARTING
> I1112 17:53:29.751597 29705 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 418265ns
> I1112 17:53:29.751699 29705 replica.cpp:323] Persisted replica status to 
> STARTING
> I1112 17:53:29.752048 29710 recover.cpp:475] Replica is in STARTING status
> I1112 17:53:29.752780 29710 replica.cpp:676] Replica in STARTING status 
> received a broadcasted recover request from (5642)@172.17.6.242:48157
> I1112 17:53:29.753223 29705 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1112 17:53:29.753618 29702 recover.cpp:566] Updating replica status to VOTING
> I1112 17:53:29.754117 29696 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 357564ns
> I1112 17:53:29.754144 29696 replica.cpp:323] Persisted replica status to 
> VOTING
> I1112 17:53:29.754236 29696 recover.cpp:580] Successfully joined the Paxos 
> group
> I1112 17:53:29.754380 29696 recover.cpp:464] Recover process terminated
> I1112 17:53:29.755972 29705 master.cpp:367] Master 
> 4104c1dc-cb09-41a0-8f89-339ad511ce2a (bf058f59fe00) started on 
> 172.17.6.242:48157
> I1112 17:53:29.755993 29705 master.cpp:369] Flags at startup: 
> --acls="register_frameworks {
>   principals {
> values: "test-principal"
>   }
>   roles {
> values: "role1"
>   }
> }
> " --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/JRCPBA/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" --roles="role1" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/JRCPBA/master" --zk_session_timeout="10secs"
> I1112 17:53:29.756337 29705 master.cpp:414] Master only allowing 
> authenticated frameworks to register
> I1112 17:53:29.756350 29705 master.cpp:419] Master only allowing 
> authenticated slaves to register
> I1112 17:53:29.756358 29705 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/JRCPBA/credentials'
> I1112 17:53:29.756618 29705 master.cpp:458] Using default 'crammd5' 
> authenticator
> I1112 17:53:29.756753 29705 master.cpp:495] Authorization enabled
> I1112 17:53:29.756985 29706 whitelist_watcher.cpp:79] No whitelist given
> I1112 17:53:29.757154 29698 hierarchical.cpp:151] Initialized hierarchical 
> allocator process
> I1112 17:53:29.759389 29700 master.cpp:1606] The newly elected leader is 
> master@172.17.6.242:48157 with id 4104c1dc-cb09-41a0-8f89-339ad511ce2a
> I1112 17:53:29.759423 29700 master.cpp:1619] Elected as the leading master!
> I1112 17:53:29.759443 29700 master.cpp:1379] 

[jira] [Commented] (MESOS-3362) Allow Isolators to advertise "capabilities" via SlaveInfo

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042266#comment-15042266
 ] 

Niklas Quarfot Nielsen commented on MESOS-3362:
---

[~vinodkone] They probably overlap. Think these were higher level capabilities. 
One isolator (say, through sitting within a module with a user/vendor selected 
name), that one may provide one or more capabilities.
Maybe @mesos-2221 is enough for now. [~karya] what do you think?

> Allow Isolators to advertise "capabilities" via SlaveInfo
> -
>
> Key: MESOS-3362
> URL: https://issues.apache.org/jira/browse/MESOS-3362
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> A network-isolator module can thus advertise that it can assign per-container 
> IP and can provide network-isolation.
> The SlaveInfo protobuf will be extended to include "Capabilities" similar to 
> FrameworkInfo::Capabilities.
> The isolator interface needs to be extended to create `info()` that return a 
> `IsolatorInfo` message. The `IsolatorInfo` message can include "Capabilities" 
> to be sent to Frameworks as part of SlaveInfo.
> The Isolator::info() interface will be used by Slave during initialization to 
> compile SlaveInfo::Capabilities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-12-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042325#comment-15042325
 ] 

Benjamin Mahler commented on MESOS-3851:


The fix is committed, it would be great to re-enable the CHECKs in order to 
detect this issue should it still be present:

{noformat}
commit 4201c2c3e5849a01d0a63769404bad03792ae5de
Author: Anand Mazumdar 
Date:   Fri Dec 4 14:15:26 2015 -0800

Linked against the executor in the agent to ensure ordered message delivery.

Previously, we did not `link` against the executor `PID` while
(re)-registering. This might lead to libprocess creating ephemeral
sockets everytime a `send` was invoked. This was leading to races
where messages might appear on the Executor out of order. This change
does a `link` on the executor PID to ensure ordered message delivery.

Review: https://reviews.apache.org/r/40660
{noformat}

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> 

[jira] [Updated] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.contains(slaveId))

2015-12-04 Thread Mandeep Chadha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mandeep Chadha updated MESOS-4071:
--
Summary: Master crash during framework teardown ( Check failed: 
total.resources.contains(slaveId))  (was: Master crash during framework 
teardown ( Check failed: total.resources.cont ains(slaveId)))

> Master crash during framework teardown ( Check failed: 
> total.resources.contains(slaveId))
> -
>
> Key: MESOS-4071
> URL: https://issues.apache.org/jira/browse/MESOS-4071
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Mandeep Chadha
>
> Stack Trace :
> NOTE : Replaced IP address with XX.XX.XX.XX 
> I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
> framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
> scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
> (mloop-coprocesses-183c4999-9ce9-47b2-bc96-a8
> 65c672fcbb (TEST)) at 
> scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
> I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
> 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
> F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: 
> total.resources.cont
> ains(slaveId)
> *** Check failure stack trace: ***
> @ 0x7f2b3dda53d8  google::LogMessage::Fail()
> @ 0x7f2b3dda5327  google::LogMessage::SendToLog()
> @ 0x7f2b3dda4d38  google::LogMessage::Flush()
> @ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2b3d3351a1  
> mesos::internal::master::allocator::DRFSorter::remove(
> )
> @ 0x7f2b3d0b8c29  
> mesos::internal::master::allocator::HierarchicalAlloca
> torProcess<>::removeFramework()
> @ 0x7f2b3d0ca823  
> _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESJ_
> @ 0x7f2b3d0dc8dc  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11FrameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
> _
> @ 0x7f2b3dd2cc35  std::function<>::operator()()
> @ 0x7f2b3dd15ae5  process::ProcessBase::visit()
> @ 0x7f2b3dd188e2  process::DispatchEvent::visit()
> @   0x472366  process::ProcessBase::serve()
> @ 0x7f2b3dd1203f  process::ProcessManager::resume()
> @ 0x7f2b3dd061b2  process::internal::schedule()
> @ 0x7f2b3dd63efd  
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
> x_tupleIJXspT_EEE
> @ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
> @ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
> @   0x318c2b6470  (unknown)
> @   0x318b2079d1  (unknown)
> @   0x318aae8b5d  (unknown)
> @  (nil)  (unknown)
> Aborted (core dumped)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042259#comment-15042259
 ] 

Niklas Quarfot Nielsen commented on MESOS-3740:
---

[~tnachen] Sorry for the radio silence. Do you have capacity to take this 
on/shepherding?

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3485) Make hook execution order deterministic

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042262#comment-15042262
 ] 

Niklas Quarfot Nielsen commented on MESOS-3485:
---

Awesome! I think we can land your patch if we can figure out a way to test it :)

> Make hook execution order deterministic
> ---
>
> Key: MESOS-3485
> URL: https://issues.apache.org/jira/browse/MESOS-3485
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Felix Abecassis
>Assignee: haosdent
>
> Currently, when using multiple hooks of the same type, the execution order is 
> implementation-defined. 
> This is because in src/hook/manager.cpp, the list of available hooks is 
> stored in a {{hashmap}}. A hashmap is probably unnecessary for 
> this task since the number of hooks should remain reasonable. A data 
> structure preserving ordering should be used instead to allow the user to 
> predict the execution order of the hooks. I suggest that the execution order 
> should be the order in which hooks are specified with {{--hooks}} when 
> starting an agent/master.
> This will be useful when combining multiple hooks after MESOS-3366 is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4070) numify() handles negative numbers inconsistently.

2015-12-04 Thread Jie Yu (JIRA)
Jie Yu created MESOS-4070:
-

 Summary: numify() handles negative numbers inconsistently.
 Key: MESOS-4070
 URL: https://issues.apache.org/jira/browse/MESOS-4070
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu


As pointed by [~neilc] in this review:
https://reviews.apache.org/r/40988

{noformat}
Try num2 = numify("-10");
EXPECT_SOME_EQ(-10, num2);

// TODO(neilc): This is inconsistent with the handling of non-hex numbers.
EXPECT_ERROR(numify("-0x10"));
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4070) numify() handles negative numbers inconsistently.

2015-12-04 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-4070:
--
Labels: tech-debt  (was: )

> numify() handles negative numbers inconsistently.
> -
>
> Key: MESOS-4070
> URL: https://issues.apache.org/jira/browse/MESOS-4070
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Jie Yu
>  Labels: tech-debt
>
> As pointed by [~neilc] in this review:
> https://reviews.apache.org/r/40988
> {noformat}
> Try num2 = numify("-10");
> EXPECT_SOME_EQ(-10, num2);
> // TODO(neilc): This is inconsistent with the handling of non-hex numbers.
> EXPECT_ERROR(numify("-0x10"));
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4071) Master crash during framework teardown ( Check failed: total.resources.cont ains(slaveId))

2015-12-04 Thread Mandeep Chadha (JIRA)
Mandeep Chadha created MESOS-4071:
-

 Summary: Master crash during framework teardown ( Check failed: 
total.resources.cont ains(slaveId))
 Key: MESOS-4071
 URL: https://issues.apache.org/jira/browse/MESOS-4071
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.25.0
Reporter: Mandeep Chadha



Stack Trace :

NOTE : Replaced IP address with XX.XX.XX.XX 

I1204 10:31:03.391127 2588810 master.cpp:5564] Processing TEARDOWN call for 
framework 61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a865c672fcbb (TEST) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391177 2588810 master.cpp:5576] Removing framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014 
(mloop-coprocesses-183c4999-9ce9-47b2-bc96-a8
65c672fcbb (TEST)) at 
scheduler-c8ab2103-cf36-40d8-8a2d-a6b69a8fc...@xx.xx.xx.xx:35237
I1204 10:31:03.391337 2588805 hierarchical.hpp:605] Deactivated framework 
61ce62d1-7418-4ae1-aa78-a8ebf75ad502-0014
F1204 10:31:03.395500 2588810 sorter.cpp:233] Check failed: total.resources.cont
ains(slaveId)
*** Check failure stack trace: ***
@ 0x7f2b3dda53d8  google::LogMessage::Fail()
@ 0x7f2b3dda5327  google::LogMessage::SendToLog()
@ 0x7f2b3dda4d38  google::LogMessage::Flush()
@ 0x7f2b3dda7a6c  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2b3d3351a1  mesos::internal::master::allocator::DRFSorter::remove(
)
@ 0x7f2b3d0b8c29  mesos::internal::master::allocator::HierarchicalAlloca
torProcess<>::removeFramework()
@ 0x7f2b3d0ca823  _ZZN7process8dispatchIN5mesos8internal6master9allocato
r21MesosAllocatorProcessERKNS1_11FrameworkIDES6_EEvRKNS_3PIDIT_EEMSA_FvT0_ET1_EN
KUlPNS_11ProcessBaseEE_clESJ_
@ 0x7f2b3d0dc8dc  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZN
S0_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS5_11Fr
ameworkIDESA_EEvRKNS0_3PIDIT_EEMSE_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2
_
@ 0x7f2b3dd2cc35  std::function<>::operator()()
@ 0x7f2b3dd15ae5  process::ProcessBase::visit()
@ 0x7f2b3dd188e2  process::DispatchEvent::visit()
@   0x472366  process::ProcessBase::serve()
@ 0x7f2b3dd1203f  process::ProcessManager::resume()
@ 0x7f2b3dd061b2  process::internal::schedule()
@ 0x7f2b3dd63efd  _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Inde
x_tupleIJXspT_EEE
@ 0x7f2b3dd63e4d  std::_Bind_simple<>::operator()()
@ 0x7f2b3dd63de6  std::thread::_Impl<>::_M_run()
@   0x318c2b6470  (unknown)
@   0x318b2079d1  (unknown)
@   0x318aae8b5d  (unknown)
@  (nil)  (unknown)
Aborted (core dumped)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3397) sorter.cpp: Check failed: total.resources.contains(slaveId)

2015-12-04 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042402#comment-15042402
 ] 

Joris Van Remoortere commented on MESOS-3397:
-

As per a discussion on IRC and related to MESOS-4071, this is very likely 
caused by resource math deltas triggering different logical branches in the 
code.

> sorter.cpp: Check failed: total.resources.contains(slaveId)
> ---
>
> Key: MESOS-3397
> URL: https://issues.apache.org/jira/browse/MESOS-3397
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.0
>Reporter: Yan Xu
>
> Observed in production.
> {noformat:title=}
> F0908 23:21:10.635751  6884 sorter.cpp:213] Check failed: 
> total.resources.contains(slaveId)
> *** Check failure stack trace: ***
> @ 0x7f772cdb10bd  google::LogMessage::Fail()
> @ 0x7f772cdb2f04  google::LogMessage::SendToLog()
> @ 0x7f772cdb0cac  google::LogMessage::Flush()
> @ 0x7f772cdb37f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f772c8162d0  
> mesos::internal::master::allocator::DRFSorter::remove()
> @ 0x7f772c6f61bc  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> @ 0x7f772cd61f09  process::ProcessManager::resume()
> @ 0x7f772cd6220f  process::internal::schedule()
> @ 0x7f772ce73610  execute_native_thread_routine
> @ 0x7f772bcb883d  start_thread
> @ 0x7f772b4aafdd  clone
> {noformat}
> This is following a framework removal:
> {noformat:title=}
> I0908 23:21:10.619640  6884 master.cpp:4261] Framework failover timeout, 
> removing framework 20150813-182946-1685138442-5050-58479-0425 (Some 
> Scheduler) at scheduler-3c50e28c-a0f4-4619-8ea0-b786744e6e54@x.y.z.a:33952
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3515) Support Subscribe Call for HTTP based Executors

2015-12-04 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-3515:
--
Fix Version/s: (was: 0.26.0)
   0.27.0

> Support Subscribe Call for HTTP based Executors
> ---
>
> Key: MESOS-3515
> URL: https://issues.apache.org/jira/browse/MESOS-3515
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> We need to add a {{subscribe(...)}} method in {{src/slave/slave.cpp}} to 
> introduce the ability for HTTP based executors to subscribe and then receive 
> events on the persistent HTTP connection. Most of the functionality needed 
> would be similar to {{Master::subscribe}} in {{src/master/master.cpp}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4068) Investigate whether multiple log coordinators can be elected concurrently

2015-12-04 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4068:
---
Summary: Investigate whether multiple log coordinators can be elected 
concurrently  (was: Multiple log coordinators can be elected concurrently)

> Investigate whether multiple log coordinators can be elected concurrently
> -
>
> Key: MESOS-4068
> URL: https://issues.apache.org/jira/browse/MESOS-4068
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
> Attachments: rep_log_multiple_coord_test_case-1.patch
>
>
> Attached is a test case (reduced from a test case submitted by Michael Maged 
> at IBM).
> Running {{mesos-tests --gtest_filter="CoordinatorTest.SimulationDriver" 
> --gtest_repeat=1000}} consistently fails in < 100 iterations on a CentOS 7 
> VM. However, it only failed ~once in 12,000 iterations on the Mac OS X 10.10 
> host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3361) Update MesosContainerizer to dynamically pick/enable isolators

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042272#comment-15042272
 ] 

Niklas Quarfot Nielsen commented on MESOS-3361:
---

[~karya] Do you still want to work on this?

> Update MesosContainerizer to dynamically pick/enable isolators
> --
>
> Key: MESOS-3361
> URL: https://issues.apache.org/jira/browse/MESOS-3361
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> This would allow the frameworks to opt-in/opt-out of network isolation per 
> container. Thus, one can launch some containers with their own IPs while 
> other containers still share the host IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4070) numify() handles negative numbers inconsistently.

2015-12-04 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-4070:
--
Component/s: stout

> numify() handles negative numbers inconsistently.
> -
>
> Key: MESOS-4070
> URL: https://issues.apache.org/jira/browse/MESOS-4070
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Jie Yu
>  Labels: tech-debt
>
> As pointed by [~neilc] in this review:
> https://reviews.apache.org/r/40988
> {noformat}
> Try num2 = numify("-10");
> EXPECT_SOME_EQ(-10, num2);
> // TODO(neilc): This is inconsistent with the handling of non-hex numbers.
> EXPECT_ERROR(numify("-0x10"));
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042493#comment-15042493
 ] 

Greg Mann commented on MESOS-4067:
--

I've posted a review which fixes this flakiness, as well as two others to 
improve the tests in this file:

https://reviews.apache.org/r/40999/
https://reviews.apache.org/r/41000/
https://reviews.apache.org/r/41001/

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3585) Add a test module for ip-per-container support

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042425#comment-15042425
 ] 

Niklas Quarfot Nielsen commented on MESOS-3585:
---

Hi [~karya]; have you started work on this?

> Add a test module for ip-per-container support
> --
>
> Key: MESOS-3585
> URL: https://issues.apache.org/jira/browse/MESOS-3585
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> With the addition of {{NetworkInfo}} to allow frameworks to request 
> IP-per-container for their tasks, we should add a simple module that mimics 
> the behavior of a real network-isolation module for testing purposes. We can 
> then add this module in {{src/examples}} and write some tests against it.
> This module can also serve as a template module for third-party network 
> isolation provides for building their own network isolator modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4005) Support workdir runtime configuration from image

2015-12-04 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-4005:
---

Assignee: Gilbert Song

> Support workdir runtime configuration from image 
> -
>
> Key: MESOS-4005
> URL: https://issues.apache.org/jira/browse/MESOS-4005
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> We need to support workdir runtime configuration returned from image such as 
> Dockerfile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4004) Support default entrypoint and command runtime config in Mesos containerizer

2015-12-04 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-4004:
---

Assignee: Gilbert Song

> Support default entrypoint and command runtime config in Mesos containerizer
> 
>
> Key: MESOS-4004
> URL: https://issues.apache.org/jira/browse/MESOS-4004
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> We need to use the entrypoint and command runtime configuration returned from 
> image to be used in Mesos containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-1718) Command executor can overcommit the slave.

2015-12-04 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-1718:

Comment: was deleted

(was: [~idownes], yes; it asked slave to report patch to master in my draft RR, 
so master build the executor for command line. And I cut some resources from 
task resources to command line executor, e.g. 1 CPU command line tasks will use 
0.1 GPU for executor & 0.9 for task.

Regarding ExecutorInfo, i'd like to add {{CommandInfo task_command}} into 
{{ExecutorInfo}}; I'd like to build a new executor info for command line tasks, 
the task's command line info are in {{task_command}}, command line executor 
info are in {{ExecutorInfo}}.)

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Ian Downes
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2015-12-04 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042586#comment-15042586
 ] 

Klaus Ma commented on MESOS-1718:
-

[~idownes], yes; it asked slave to report patch to master in my draft RR, so 
master build the executor for command line. And I cut some resources from task 
resources to command line executor, e.g. 1 CPU command line tasks will use 0.1 
GPU for executor & 0.9 for task.

Regarding ExecutorInfo, i'd like to add {{CommandInfo task_command}} into 
{{ExecutorInfo}}; I'd like to build a new executor info for command line tasks, 
the task's command line info are in {{task_command}}, command line executor 
info are in {{ExecutorInfo}}.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Ian Downes
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2015-12-04 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042587#comment-15042587
 ] 

Klaus Ma commented on MESOS-1718:
-

[~idownes], yes; it asked slave to report patch to master in my draft RR, so 
master build the executor for command line. And I cut some resources from task 
resources to command line executor, e.g. 1 CPU command line tasks will use 0.1 
GPU for executor & 0.9 for task.

Regarding ExecutorInfo, i'd like to add {{CommandInfo task_command}} into 
{{ExecutorInfo}}; I'd like to build a new executor info for command line tasks, 
the task's command line info are in {{task_command}}, command line executor 
info are in {{ExecutorInfo}}.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Ian Downes
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1807) Disallow executors with cpu only or memory only resources

2015-12-04 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042590#comment-15042590
 ] 

Avinash Sridharan commented on MESOS-1807:
--

Hi Vinod,
 Is this ticket still a newbie ticket or we are waiting for MESOS-1187 and 
MESOS-1718 to be fixed before we look at this. Was thinking of taking up a 
newbie issue to start off. So wanted to check if we can make progress on this?

> Disallow executors with cpu only or memory only resources
> -
>
> Key: MESOS-1807
> URL: https://issues.apache.org/jira/browse/MESOS-1807
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>  Labels: newbie
> Attachments: Screenshot 2015-07-28 14.40.35.png
>
>
> Currently master allows executors to be launched with either only cpus or 
> only memory but we shouldn't allow that.
> This is because executor is an actual unix process that is launched by the 
> slave. If an executor doesn't specify cpus, what should do the cpu limits be 
> for that executor when there are no tasks running on it? If no cpu limits are 
> set then it might starve other executors/tasks on the slave violating 
> isolation guarantees. Same goes with memory. Moreover, the current 
> containerizer/isolator code will throw failures when using such an executor, 
> e.g., when the last task on the executor finishes and Containerizer::update() 
> is called with 0 cpus or 0 mem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4068) Multiple log coordinators can be elected concurrently

2015-12-04 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4068:
---
Description: 
Attached is a test case (reduced from a test case submitted by Michael Maged at 
IBM).

Running {{mesos-tests --gtest_filter="CoordinatorTest.SimulationDriver" 
--gtest_repeat=1000}} consistently fails in < 100 iterations on a CentOS 7 VM. 
However, it only failed ~once in 12,000 iterations on the Mac OS X 10.10 host.

  was:
Attached is a test case (reduced from a test case submitted by Michael Maged at 
IBM).

Running {{mesos-tests --gtest_filter="CoordinatorTest.SimulationDriver" 
--gtest_repeat=1000}} consistently fails in < 100 iterations on a CentOS 7 VM. 
However, it passes about 15,000 iterations on the Mac OS 10.10 host.


> Multiple log coordinators can be elected concurrently
> -
>
> Key: MESOS-4068
> URL: https://issues.apache.org/jira/browse/MESOS-4068
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Attached is a test case (reduced from a test case submitted by Michael Maged 
> at IBM).
> Running {{mesos-tests --gtest_filter="CoordinatorTest.SimulationDriver" 
> --gtest_repeat=1000}} consistently fails in < 100 iterations on a CentOS 7 
> VM. However, it only failed ~once in 12,000 iterations on the Mac OS X 10.10 
> host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4068) Multiple log coordinators can be elected concurrently

2015-12-04 Thread Neil Conway (JIRA)
Neil Conway created MESOS-4068:
--

 Summary: Multiple log coordinators can be elected concurrently
 Key: MESOS-4068
 URL: https://issues.apache.org/jira/browse/MESOS-4068
 Project: Mesos
  Issue Type: Bug
  Components: replicated log
Reporter: Neil Conway
Assignee: Neil Conway


Attached is a test case (reduced from a test case submitted by Michael Maged at 
IBM).

Running {{mesos-tests --gtest_filter="CoordinatorTest.SimulationDriver" 
--gtest_repeat=1000}} consistently fails in < 100 iterations on a CentOS 7 VM. 
However, it passes about 15,000 iterations on the Mac OS 10.10 host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4068) Multiple log coordinators can be elected concurrently

2015-12-04 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4068:
---
Attachment: rep_log_multiple_coord_test_case-1.patch

> Multiple log coordinators can be elected concurrently
> -
>
> Key: MESOS-4068
> URL: https://issues.apache.org/jira/browse/MESOS-4068
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
> Attachments: rep_log_multiple_coord_test_case-1.patch
>
>
> Attached is a test case (reduced from a test case submitted by Michael Maged 
> at IBM).
> Running {{mesos-tests --gtest_filter="CoordinatorTest.SimulationDriver" 
> --gtest_repeat=1000}} consistently fails in < 100 iterations on a CentOS 7 
> VM. However, it only failed ~once in 12,000 iterations on the Mac OS X 10.10 
> host.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-04 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3586:
-
Shepherd: Bernd Mathiske

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing {{../configure}}, {{make}}, and {{make check}} some servers 
> have completed successfully and other failed on test {{[ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}.
> Is there something I should check in this test? 
> {code}
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-04 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4047:
-
Shepherd: Bernd Mathiske

> MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
> ---
>
> Key: MESOS-4047
> URL: https://issues.apache.org/jira/browse/MESOS-4047
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
> Environment: Ubuntu 14, gcc 4.8.4
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> {code:title=Output from passed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
> I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Registered executor on ubuntu
> Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 5085
> I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from 
> slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Re-registered executor on ubuntu
> Shutting down
> Sending SIGTERM to process tree at pid 5085
> Killing the following process trees:
> [ 
> -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
>  \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
> ]
> [   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
> {code}
> {code:title=Output from failed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
> I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
> 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> Registered executor on ubuntu
> Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
> Forked command at 5132
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from 
> slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
> Shutting down
> Sending SIGTERM to process tree at pid 5132
> ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
> (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
> *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
> using GNU date ***
> {code}
> Notice that in the failed test, the executor is asked to shutdown when it 
> tries to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2015-12-04 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042001#comment-15042001
 ] 

Ian Downes commented on MESOS-1718:
---

To expand on [~vi...@twitter.com]'s comment earlier with some thoughts. The 
master could choose the command line executor, and choose resources for it, 
without knowing the actual path, this would be completed by the slave at 
runtime. This just requires the assumption that the slave always has a command 
executor available and that it knows its location (which could be different for 
different slaves) to be true. This would consolidate much of the logic and 
properly account for the command executor's resource usage at the master, 
rather than hacks at the slave.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Ian Downes
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042038#comment-15042038
 ] 

Greg Mann commented on MESOS-4067:
--

Ah, sorry [~kaysoky]! I didn't see your comment when I posted that last one. 
Yep, I think that pausing the clock is the best and most obvious solution here 
:-) thanks!

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4067) ReservationTest.ACLMultipleOperations is flaky

2015-12-04 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042038#comment-15042038
 ] 

Greg Mann edited comment on MESOS-4067 at 12/4/15 7:28 PM:
---

Ah, sorry [~kaysoky]! I didn't see your comment when I posted that last one. 
Yep, I think that pausing the clock is the best solution here :-) thanks!


was (Author: greggomann):
Ah, sorry [~kaysoky]! I didn't see your comment when I posted that last one. 
Yep, I think that pausing the clock is the best and most obvious solution here 
:-) thanks!

> ReservationTest.ACLMultipleOperations is flaky
> --
>
> Key: MESOS-4067
> URL: https://issues.apache.org/jira/browse/MESOS-4067
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>  Labels: flaky, mesosphere
>
> Observed from the CI: 
> https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=ubuntu%3A14.04,label_exp=docker%7C%7CHadoop/1319/changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4069) libevent_ssl_socket assertion fails

2015-12-04 Thread Jojy Varghese (JIRA)
Jojy Varghese created MESOS-4069:


 Summary: libevent_ssl_socket assertion fails 
 Key: MESOS-4069
 URL: https://issues.apache.org/jira/browse/MESOS-4069
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
 Environment: ubuntu 14.04
Reporter: Jojy Varghese
Assignee: Jojy Varghese


Have been seeing the following socket  receive error frequently:

{code}
F1204 11:12:47.301839 54104 libevent_ssl_socket.cpp:245] Check failed: length > 
0 
*** Check failure stack trace: ***
@ 0x7f73227fe5a6  google::LogMessage::Fail()
@ 0x7f73227fe4f2  google::LogMessage::SendToLog()
@ 0x7f73227fdef4  google::LogMessage::Flush()
@ 0x7f7322800e08  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f73227b93e2  
process::network::LibeventSSLSocketImpl::recv_callback()
@ 0x7f73227b9182  
process::network::LibeventSSLSocketImpl::recv_callback()
@ 0x7f731cbc75cc  bufferevent_run_deferred_callbacks_locked
@ 0x7f731cbbdc5d  event_base_loop
@ 0x7f73227d9ded  process::EventLoop::run()
@ 0x7f73227a3101  
_ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7f73227a305b  std::_Bind_simple<>::operator()()
@ 0x7f73227a2ff4  std::thread::_Impl<>::_M_run()
@ 0x7f731e0d1a40  (unknown)
@ 0x7f731de0a182  start_thread
@ 0x7f731db3730d  (unknown)
@  (nil)  (unknown)

{code}

In this case this was a HTTP get over SSL. The url being:

https://dseasb33srnrn.cloudfront.net:443/registry-v2/docker/registry/v2/blobs/sha256/44/44be94a95984bb47dc3a193f59bf8c04d5e877160b745b119278f38753a6f58f/data?Expires=1449257566=DGURs3liw6L2LxFTK01o4CXc27e5Ol2R3lwKxmxtX7zyeqr3Bs8s8qbroF3T-JXUbY9cH62wNL5Jcu-Su6Rcj~ckS73ACVyzfuuTMZTvtmcBKAxvfUmZKOKGAMqwtLnD-X4GKgDDCywkg67MQBkm5owM-vk6SynFSzjQKpLt~~4_=APKAJECH5M7VWIS5YZ6Q


*Steps to reproduce:*

1. Run master
2. Run slave from your build directory as  as:

{code}
 
GLOG_v=1;SSL_ENABLED=1;SSL_KEY_FILE=;SSL_CERT_FILE=;sudo
 -E ./bin/mesos-slave.sh \
  --master=127.0.0.1:5050 \ 
 
  --executor_registration_timeout=5mins \   
 
  --containerizers=mesos  \ 
 
  --isolation=filesystem/linux \
 
  --image_providers=DOCKER  \   
 
  --docker_puller_timeout=600 \ 
 
  --launcher_dir=$MESOS_BUILD_DIR/src/.libs \   
 
  --switch_user="false" \   
 
  --docker_puller="registry"  
{code} 

3. Run mesos-execute from your build directory as :

{code}
./src/mesos-execute \   
 
--master=127.0.0.1:5050 \   
 
--command="uname -a"  \ 
 
--name=test \   
 
--docker_image=ubuntu 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)