[jira] [Created] (MESOS-2530) Alloc-dealloc-mismatch in OsSendfileTest.sendfile
Joerg Schad created MESOS-2530: -- Summary: Alloc-dealloc-mismatch in OsSendfileTest.sendfile Key: MESOS-2530 URL: https://issues.apache.org/jira/browse/MESOS-2530 Project: Mesos Issue Type: Bug Reporter: Joerg Schad Assignee: Joerg Schad GCC's AdressSanitizer stumbled acrosss the following issue (thanks [~tillt]): {noformat} [--] 1 test from OsSendfileTest [ RUN ] OsSendfileTest.sendfile = ==65404== ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x60 30fe40 #0 0x2b8d6acc99da (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x119da) #1 0x52df06 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x52df06) #2 0x593e59 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x593e59) #3 0x58bd83 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x58bd83) #4 0x567561 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x567561) #5 0x568049 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x568049) #6 0x5688a4 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x5688a4) #7 0x56fb6e (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x56fb6e) #8 0x595713 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x595713) #9 0x58d2e3 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x58d2e3) #10 0x56e0bb (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x56e0bb) #11 0x4ca74b (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x4ca74b) #12 0x2b8d6f385ec4 (/lib/x86_64-linux-gnu/libc-2.19.so+0x21ec4) 0x6030fe40 is located 0 bytes inside of 446-byte region [0x6030fe40,0x6030fffe) allocated by thread T0 here: #0 0x2b8d6acc988a (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x1188a) #1 0x52dba6 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x52dba6) #2 0x593e59 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x593e59) #3 0x58bd83 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x58bd83) #4 0x567561 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x567561) #5 0x568049 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x568049) #6 0x5688a4 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x5688a4) #7 0x56fb6e (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x56fb6e) #8 0x595713 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x595713) #9 0x58d2e3 (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x58d2e3) #10 0x56e0bb (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x56e0bb) #11 0x4ca74b (/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty/stout-tests+0x4ca74b) #12 0x2b8d6f385ec4 (/lib/x86_64-linux-gnu/libc-2.19.so+0x21ec4) ==65404== HINT: if you don't care about these warnings you may set ASAN_OPTIONS=alloc_dealloc_mismatch=0 ==65404== ABORTING make[7]: *** [check-local] Error 1 make[7]: Leaving directory `/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty' make[6]: *** [check-am] Error 2 make[6]: Leaving directory `/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty' make[5]: *** [check-recursive] Error 1 make[5]: Leaving directory `/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty' make[4]: *** [check] Error 2 make[4]: Leaving directory `/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess/3rdparty' make[3]: *** [check-recursive] Error 1 make[3]: Leaving directory `/mnt/hgfs/till/Development/mesos-private/build/3rdparty/libprocess' make[2]: *** [check-recursive] Error 1 make[2]: Leaving directory `/mnt/hgfs/till/Development/mesos-private/build/3rdparty' make[1]: *** [check] Error 2 make[1]: Leaving directory `/mnt/hgfs/till/Development/mesos-private/build/3rdparty' make: *** [check-recursive] Error 1 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2531) Libmesos terminates JVM
[ https://issues.apache.org/jira/browse/MESOS-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Kiędyś updated MESOS-2531: - Environment: (was: Mesos #a12242b Marathon #6decf76 java version 1.8.0 Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode) System Software Overview: System Version: OS X 10.10.2 (14C109) Kernel Version: Darwin 14.1.0 Secure Virtual Memory: Enabled Time since boot: 13 days 11:02) Libmesos terminates JVM --- Key: MESOS-2531 URL: https://issues.apache.org/jira/browse/MESOS-2531 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0 Reporter: Michał Kiędyś I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS Yosemite and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs sleep command for 120 seconds and scaling that application to ten my Marathon died killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.state.GroupManager:142) [2015-03-23 15:47:17,877] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] PUT /v2/apps//bar HTTP/1.1 200 92 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:17,918] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] GET /v2/apps//bar/versions HTTP/1.1 200 68 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,722] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/apps HTTP/1.1 200 592 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,782] INFO Received status update for task bar.82501637-d16b-11e4-b7fa-aa4dda3d2dbb: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:149) [2015-03-23 15:47:20,790] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/deployments HTTP/1.1 200 256 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00012ec946f7, pid=98294, tid=27651 # # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libmesos-0.23.0.dylib+0x7836f7] process::Futuremesos::internal::state::Variable::isFailed() const+0x17 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /Users/mkiedys/Downloads/MESOS/marathon/hs_err_pid98294.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Abort trap: 6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2531) Libmesos terminates JVM
Michał Kiędyś created MESOS-2531: Summary: Libmesos terminates JVM Key: MESOS-2531 URL: https://issues.apache.org/jira/browse/MESOS-2531 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0 Environment: Mesos #a12242b Marathon #6decf76 java version 1.8.0 Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode) System Software Overview: System Version: OS X 10.10.2 (14C109) Kernel Version: Darwin 14.1.0 Secure Virtual Memory: Enabled Time since boot: 13 days 11:02 Reporter: Michał Kiędyś I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS Yosemite and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs sleep command for 120 seconds and scaling that application to ten my Marathon died killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.state.GroupManager:142) [2015-03-23 15:47:17,877] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] PUT /v2/apps//bar HTTP/1.1 200 92 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:17,918] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] GET /v2/apps//bar/versions HTTP/1.1 200 68 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,722] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/apps HTTP/1.1 200 592 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,782] INFO Received status update for task bar.82501637-d16b-11e4-b7fa-aa4dda3d2dbb: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:149) [2015-03-23 15:47:20,790] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/deployments HTTP/1.1 200 256 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00012ec946f7, pid=98294, tid=27651 # # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libmesos-0.23.0.dylib+0x7836f7] process::Futuremesos::internal::state::Variable::isFailed() const+0x17 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /Users/mkiedys/Downloads/MESOS/marathon/hs_err_pid98294.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Abort trap: 6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2531) Libmesos terminates JVM
[ https://issues.apache.org/jira/browse/MESOS-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Kiędyś updated MESOS-2531: - Description: I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS Yosemite and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs sleep command for 120 seconds and scaling that application to ten my Marathon died killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.state.GroupManager:142) [2015-03-23 15:47:17,877] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] PUT /v2/apps//bar HTTP/1.1 200 92 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:17,918] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] GET /v2/apps//bar/versions HTTP/1.1 200 68 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,722] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/apps HTTP/1.1 200 592 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,782] INFO Received status update for task bar.82501637-d16b-11e4-b7fa-aa4dda3d2dbb: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:149) [2015-03-23 15:47:20,790] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/deployments HTTP/1.1 200 256 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00012ec946f7, pid=98294, tid=27651 # # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libmesos-0.23.0.dylib+0x7836f7] process::Futuremesos::internal::state::Variable::isFailed() const+0x17 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /Users/mkiedys/Downloads/MESOS/marathon/hs_err_pid98294.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Abort trap: 6 {noformat} Mesos #a12242b Marathon #6decf76 java version 1.8.0 Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode) System Software Overview: System Version: OS X 10.10.2 (14C109) Kernel Version: Darwin 14.1.0 Secure Virtual Memory: Enabled Time since boot: 13 days 11:02 was: I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS Yosemite and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs sleep command for 120 seconds and scaling that application to ten my Marathon died killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))),
[jira] [Issue Comment Deleted] (MESOS-2531) Libmesos terminates JVM
[ https://issues.apache.org/jira/browse/MESOS-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Kiędyś updated MESOS-2531: - Comment: was deleted (was: Error report file with more information) Libmesos terminates JVM --- Key: MESOS-2531 URL: https://issues.apache.org/jira/browse/MESOS-2531 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0 Reporter: Michał Kiędyś Attachments: hs_err_pid98294.log I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs {{sleep}} command for 120 seconds and scaling that application to ten my Marathon crushed killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. h4. Log {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.state.GroupManager:142) [2015-03-23 15:47:17,877] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] PUT /v2/apps//bar HTTP/1.1 200 92 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:17,918] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] GET /v2/apps//bar/versions HTTP/1.1 200 68 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,722] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/apps HTTP/1.1 200 592 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,782] INFO Received status update for task bar.82501637-d16b-11e4-b7fa-aa4dda3d2dbb: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:149) [2015-03-23 15:47:20,790] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/deployments HTTP/1.1 200 256 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00012ec946f7, pid=98294, tid=27651 # # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libmesos-0.23.0.dylib+0x7836f7] process::Futuremesos::internal::state::Variable::isFailed() const+0x17 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /Users/mkiedys/Downloads/MESOS/marathon/hs_err_pid98294.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Abort trap: 6 {noformat} h4. Java java version 1.8.0 Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode) h4. System Software Overview - System Version: OS X 10.10.2 (14C109) - Kernel Version: Darwin 14.1.0 - Secure Virtual Memory: Enabled - Time since boot: 13 days 11:02 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2531) Libmesos terminates JVM
[ https://issues.apache.org/jira/browse/MESOS-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michał Kiędyś updated MESOS-2531: - Attachment: hs_err_pid98294.log Error report file with more information Libmesos terminates JVM --- Key: MESOS-2531 URL: https://issues.apache.org/jira/browse/MESOS-2531 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0 Reporter: Michał Kiędyś Attachments: hs_err_pid98294.log I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs {{sleep}} command for 120 seconds and scaling that application to ten my Marathon crushed killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. h4. Log {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.state.GroupManager:142) [2015-03-23 15:47:17,877] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] PUT /v2/apps//bar HTTP/1.1 200 92 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:17,918] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] GET /v2/apps//bar/versions HTTP/1.1 200 68 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,722] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/apps HTTP/1.1 200 592 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,782] INFO Received status update for task bar.82501637-d16b-11e4-b7fa-aa4dda3d2dbb: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:149) [2015-03-23 15:47:20,790] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/deployments HTTP/1.1 200 256 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00012ec946f7, pid=98294, tid=27651 # # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libmesos-0.23.0.dylib+0x7836f7] process::Futuremesos::internal::state::Variable::isFailed() const+0x17 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /Users/mkiedys/Downloads/MESOS/marathon/hs_err_pid98294.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Abort trap: 6 {noformat} h4. Java java version 1.8.0 Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode) h4. System Software Overview - System Version: OS X 10.10.2 (14C109) - Kernel Version: Darwin 14.1.0 - Secure Virtual Memory: Enabled - Time since boot: 13 days 11:02 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2491) Persist the reservation state on the slave
[ https://issues.apache.org/jira/browse/MESOS-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375991#comment-14375991 ] Michael Park commented on MESOS-2491: - [r32398|https://reviews.apache.org/r/32398/] Persist the reservation state on the slave -- Key: MESOS-2491 URL: https://issues.apache.org/jira/browse/MESOS-2491 Project: Mesos Issue Type: Task Components: master, slave Reporter: Michael Park Assignee: Michael Park Labels: mesosphere h3. Goal The goal for this task is to persist the reservation state stored on the master on the corresponding slave. The {{needCheckpointing}} predicate is used to capture the condition for which a resource needs to be checkpointed. Currently the only condition is {{isPersistentVolume}}. We'll update this to include dynamically reserved resources. h3. Expected Outcome * The dynamically reserved resources will be persisted on the slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2475) Add the Resource::ReservationInfo protobuf message
[ https://issues.apache.org/jira/browse/MESOS-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2475: Description: The {{Resource::ReservationInfo}} protobuf message encapsulates information needed to keep track of reservations. It's named {{ReservationInfo}} rather than {{Reservation}} to keep consistency with {{Resource::DiskInfo}}. Here's essentially what it will look like in the end: {code} message ReservationInfo { // If this is set, it means that the resource is reserved for this particular // framework. Otherwise, the resource is reserved for the role. optional FrameworkID framework_id; // Indicates the principal of the operator or framework that created the // reservation. This is used to determine whether this resource can be // unreserved by an operator or a framework by checking the // unreserve ACL. required string principal; // Anyone can set this ID at the time of reservation in order to keep track. optional string id; } // If this is set, this resource was dynamically reserved by an // operator or a framework. Otherwise, this resource was // statically configured by an operator via the --resources flag. optional ReservationInfo reservation; {code} In v1, we'll only need to introduce {{framework_id}}. {{principal}} will be introduced along with the unreserved ACLs and {{id}} may be introduced in the future. was: The {{Resource::ReservationInfo}} protobuf message encapsulates information needed to keep track of reservations. It's named {{ReservationInfo}} rather than {{Reservation}} to keep consistency with {{Resource::DiskInfo}}. Here's essentially what it will look like in the end: {code} message ReservationInfo { // If this is set, it means that the resource is reserved for this particular // framework. Otherwise, the resource is reserved for the role. optional FrameworkID framework_id; // Indicates the principal of the operator or framework that created the // reservation. This is used to determine whether this resource can be // unreserved by an operator or a framework by checking the // unreserve ACL. required string principal; // Anyone can set this ID at the time of reservation in order to keep track. optional string id; } // If this is set, this resource was dynamically reserved by an operator or // a framework. Otherwise, this resource was static configured by an // operator via the --resources flag. optional ReservationInfo reservation; {code} In v1, we'll only need to introduce {{framework_id}}. {{principal}} will be introduced along with the unreserved ACLs and {{id}} may be introduced in the future. Add the Resource::ReservationInfo protobuf message -- Key: MESOS-2475 URL: https://issues.apache.org/jira/browse/MESOS-2475 Project: Mesos Issue Type: Technical task Reporter: Michael Park Assignee: Michael Park Labels: mesosphere The {{Resource::ReservationInfo}} protobuf message encapsulates information needed to keep track of reservations. It's named {{ReservationInfo}} rather than {{Reservation}} to keep consistency with {{Resource::DiskInfo}}. Here's essentially what it will look like in the end: {code} message ReservationInfo { // If this is set, it means that the resource is reserved for this particular // framework. Otherwise, the resource is reserved for the role. optional FrameworkID framework_id; // Indicates the principal of the operator or framework that created the // reservation. This is used to determine whether this resource can be // unreserved by an operator or a framework by checking the // unreserve ACL. required string principal; // Anyone can set this ID at the time of reservation in order to keep track. optional string id; } // If this is set, this resource was dynamically reserved by an // operator or a framework. Otherwise, this resource was // statically configured by an operator via the --resources flag. optional ReservationInfo reservation; {code} In v1, we'll only need to introduce {{framework_id}}. {{principal}} will be introduced along with the unreserved ACLs and {{id}} may be introduced in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2205) Add user documentation for reservations
[ https://issues.apache.org/jira/browse/MESOS-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377336#comment-14377336 ] Michael Park commented on MESOS-2205: - [~nnielsen]: For collaboration I was thinking that the comment section on the gist might suffice, but if not I can move it to a google doc. I wrote it out in markdown because I would like to land it as an arch doc in the repo. I actually don't know which wiki you're referring to here. Anyway, do you think I should move it to a google doc? Add user documentation for reservations --- Key: MESOS-2205 URL: https://issues.apache.org/jira/browse/MESOS-2205 Project: Mesos Issue Type: Documentation Components: documentation, framework Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Add a user guide for reservations which describes basic usage of them, how ACLs are used to specify who can unreserve whose resources, and few advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2353) Improve performance of the master's state.json endpoint for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376714#comment-14376714 ] Benjamin Mahler commented on MESOS-2353: [~alex-mesos] Are you planning to add the moves as well? Improve performance of the master's state.json endpoint for large clusters. --- Key: MESOS-2353 URL: https://issues.apache.org/jira/browse/MESOS-2353 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Labels: newbie, twitter The master's state.json endpoint consistently takes a long time to compute the JSON result, for large clusters: {noformat} $ time curl -s -o /dev/null localhost:5050/master/state.json Mon Jan 26 22:38:50 UTC 2015 real 0m13.174s user 0m0.003s sys 0m0.022s {noformat} This can cause the master to get backlogged if there are many state.json requests in flight. Looking at {{perf}} data, it seems most of the time is spent doing memory allocation / de-allocation. This ticket will try to capture any low hanging fruit to speed this up. Possibly we can leverage moves if they are not already being used by the compiler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2528) Symlink the namespace handle with ContainerID for the port mapping isolator.
[ https://issues.apache.org/jira/browse/MESOS-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-2528: - Assignee: Jie Yu Symlink the namespace handle with ContainerID for the port mapping isolator. Key: MESOS-2528 URL: https://issues.apache.org/jira/browse/MESOS-2528 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Assignee: Jie Yu This serves two purposes: 1) Allows us to enter the network namespace using container ID (instead of pid): ip netns exec ContainerID [commands] [args]. 2) Allows us to get container ID for orphan containers during recovery. This will be helpful for solving MESOS-2367. The challenge here is to solve it in a backward compatible way. I propose to create symlinks under /var/run/netns. For example: /var/run/netns/containerid -- /var/run/netns/12345 (12345 is the pid) The old code will only remove the bind mounts and leave the symlinks, which I think is fine since containerid is globally unique (uuid). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2402) MesosContainerizerDestroyTest.LauncherDestroyFailure is flaky
[ https://issues.apache.org/jira/browse/MESOS-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376431#comment-14376431 ] Vinod Kone commented on MESOS-2402: --- As [~idownes] mentioned the flags have to be setup properly for exec to function correctly. But that fix has nothing to do with the flakiness. The real issue seem to be that there is a race between the 'containerizer-wait()' future being set to failed and the metric being updated. This is because the thread running the test might check for the metric value in between containerizer future being set but before the metric being updated (see 'MesosContainerizerProcess::__destroy()'). The fix is to settle the clock to ensure the metric is updated. MesosContainerizerDestroyTest.LauncherDestroyFailure is flaky - Key: MESOS-2402 URL: https://issues.apache.org/jira/browse/MESOS-2402 Project: Mesos Issue Type: Bug Affects Versions: 0.23.0 Reporter: Vinod Kone Assignee: Vinod Kone Failed to os::execvpe in childMain. Never seen this one before. {code} [ RUN ] MesosContainerizerDestroyTest.LauncherDestroyFailure Using temporary directory '/tmp/MesosContainerizerDestroyTest_LauncherDestroyFailure_QpjQEn' I0224 18:55:49.326912 21391 containerizer.cpp:461] Starting container 'test_container' for executor 'executor' of framework '' I0224 18:55:49.332252 21391 launcher.cpp:130] Forked child with pid '23496' for container 'test_container' ABORT: (src/subprocess.cpp:165): Failed to os::execvpe in childMain *** Aborted at 1424832949 (unix time) try date -d @1424832949 if you are using GNU date *** PC: @ 0x2b178c5db0d5 (unknown) I0224 18:55:49.340955 21392 process.cpp:2117] Dropped / Lost event for PID: scheduler-509d37ac-296f-4429-b101-af433c1800e9@127.0.1.1:39647 I0224 18:55:49.342300 21386 containerizer.cpp:911] Destroying container 'test_container' *** SIGABRT (@0x3e85bc8) received by PID 23496 (TID 0x2b178f9f0700) from PID 23496; stack trace: *** @ 0x2b178c397cb0 (unknown) @ 0x2b178c5db0d5 (unknown) @ 0x2b178c5de83b (unknown) @ 0x87a945 _Abort() @ 0x2b1789f610b9 process::childMain() I0224 18:55:49.391793 21386 containerizer.cpp:1120] Executor for container 'test_container' has exited I0224 18:55:49.400478 21391 process.cpp:2770] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' tests/containerizer_tests.cpp:485: Failure Value of: metrics.values[containerizer/mesos/container_destroy_errors] Actual: 16-byte object 02-00 00-00 17-2B 00-00 E0-86 0E-04 00-00 00-00 Expected: 1u Which is: 1 [ FAILED ] MesosContainerizerDestroyTest.LauncherDestroyFailure (89 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2528) Symlink the namespace handle with ContainerID for the port mapping isolator.
[ https://issues.apache.org/jira/browse/MESOS-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-2528: -- Sprint: Twitter Mesos Q1 Sprint 5 Story Points: 3 Symlink the namespace handle with ContainerID for the port mapping isolator. Key: MESOS-2528 URL: https://issues.apache.org/jira/browse/MESOS-2528 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Assignee: Jie Yu This serves two purposes: 1) Allows us to enter the network namespace using container ID (instead of pid): ip netns exec ContainerID [commands] [args]. 2) Allows us to get container ID for orphan containers during recovery. This will be helpful for solving MESOS-2367. The challenge here is to solve it in a backward compatible way. I propose to create symlinks under /var/run/netns. For example: /var/run/netns/containerid -- /var/run/netns/12345 (12345 is the pid) The old code will only remove the bind mounts and leave the symlinks, which I think is fine since containerid is globally unique (uuid). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2529) fetch hdfs executor failed with sh: hadoop: command not found
[ https://issues.apache.org/jira/browse/MESOS-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375466#comment-14375466 ] Littlestar commented on MESOS-2529: --- /usr/bin/env: bash: No such file or directory === each mesos slave node has JAVA and HADOOP DataNode. mesos-master-env.sh and mesos-slave-env.sh has the following setting. export MESOS_JAVA_HOME=/home/test/jdk export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0 export MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin thanks. fetch hdfs executor failed with sh: hadoop: command not found --- Key: MESOS-2529 URL: https://issues.apache.org/jira/browse/MESOS-2529 Project: Mesos Issue Type: Bug Components: fetcher Affects Versions: 0.21.1 Reporter: Littlestar fetch hdfs executor failed with sh: hadoop: command not found I set HADOOP_HOME and PATH in /etc/profile, but it not works well. WARNING: Logging before InitGoogleLogging() is written to STDERR I0323 11:46:41.758134 9312 fetcher.cpp:76] Fetching URI 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' I0323 11:46:41.758301 9312 fetcher.cpp:105] Downloading resource from 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' to '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S1/frameworks/20150323-114534-1214949568-5050-12082-/executors/20150323-100710-1214949568-5050-3453-S1/runs/5bb19ef7-483a-4871-aa2a-cb18796775e9/spark-1.3.0-bin-2.4.0.tar.gz' E0323 11:46:41.762511 9312 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S1/frameworks/20150323-114534-1214949568-5050-12082-/executors/20150323-100710-1214949568-5050-3453-S1/runs/5bb19ef7-483a-4871-aa2a-cb18796775e9/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found Failed to fetch: hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz Failed to synchronize with slave (it's probably exited) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2532) UserCgroupIsolatorTest failures due to: Failed to prepare isolator: cgroup already exists
Benjamin Mahler created MESOS-2532: -- Summary: UserCgroupIsolatorTest failures due to: Failed to prepare isolator: cgroup already exists Key: MESOS-2532 URL: https://issues.apache.org/jira/browse/MESOS-2532 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Benjamin Mahler This is on a CentOS machine: {code: title=sudo make check -j24 MESOS_VERBOSE=1 GLOG_v=1 GTEST_FILTER=UserCgroupIsolatorTest*} - We cannot run any cgroups tests that require mounting hierarchies because you have the following hierarchies mounted: /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/freezer, /sys/fs/cgroup/memory, /sys/fs/cgroup/perf_event We'll disable the CgroupsNoHierarchyTest test fixture for now. - - We cannot run any Docker tests because: Failed to execute 'docker version': exited with status 127 - Note: Google Test filter = UserCgroupIsolatorTest*-DockerContainerizerTest.ROOT_DOCKER_Launch_Executor:DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged:DockerContainerizerTest.ROOT_DOCKER_Launch:DockerContainerizerTest.ROOT_DOCKER_Kill:DockerContainerizerTest.ROOT_DOCKER_Usage:DockerContainerizerTest.ROOT_DOCKER_Update:DockerContainerizerTest.DISABLED_ROOT_DOCKER_Recover:DockerContainerizerTest.ROOT_DOCKER_Logs:DockerContainerizerTest.ROOT_DOCKER_Default_CMD:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Override:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Args:DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer:DockerContainerizerTest.DISABLED_ROOT_DOCKER_SlaveRecoveryExecutorContainer:DockerContainerizerTest.ROOT_DOCKER_PortMapping:DockerContainerizerTest.ROOT_DOCKER_LaunchSandboxWithColon:DockerContainerizerTest.ROOT_DOCKER_DestroyWhileFetching:DockerContainerizerTest.ROOT_DOCKER_DestroyWhilePulling:DockerTest.ROOT_DOCKER_interface:DockerTest.ROOT_DOCKER_CheckCommandWithShell:DockerTest.ROOT_DOCKER_CheckPortResource:DockerTest.ROOT_DOCKER_CancelPull:CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy:SlaveCount/Registrar_BENCHMARK_Test.performance/0:SlaveCount/Registrar_BENCHMARK_Test.performance/1:SlaveCount/Registrar_BENCHMARK_Test.performance/2:SlaveCount/Registrar_BENCHMARK_Test.performance/3 [==] Running 3 tests from 3 test cases. [--] Global test environment set-up. [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user mesos.test.unprivileged.user does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup Using temporary directory '/tmp/UserCgroupIsolatorTest_0_ROOT_CGROUPS_UserCgroup_ASJu3B' ../../src/tests/isolator_tests.cpp:1067: Failure (isolator.get()-prepare( containerId, executorInfo, os::getcwd(), UNPRIVILEGED_USERNAME)).failure(): Failed to prepare isolator: cgroup already exists [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (18 ms) [--] 1 test from UserCgroupIsolatorTest/0 (18 ms total) [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess userdel: user mesos.test.unprivileged.user does not exist [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup Using temporary directory '/tmp/UserCgroupIsolatorTest_1_ROOT_CGROUPS_UserCgroup_VIwHI4' ../../src/tests/isolator_tests.cpp:1067: Failure (isolator.get()-prepare( containerId, executorInfo, os::getcwd(), UNPRIVILEGED_USERNAME)).failure(): Failed to prepare isolator: cgroup already exists [ FAILED ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (11 ms) [--] 1 test from UserCgroupIsolatorTest/1 (12 ms total) [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess userdel: user mesos.test.unprivileged.user does not exist [ RUN ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup Using temporary directory '/tmp/UserCgroupIsolatorTest_2_ROOT_CGROUPS_UserCgroup_Cm2jhz' I0323 20:47:15.297801 2047 perf_event.cpp:71] Creating PerfEvent isolator I0323 20:47:15.312007 2047 perf_event.cpp:109] PerfEvent isolator will profile for 10secs every 1mins for events: { cpu-cycles } I0323 20:47:15.312500 2069 perf_event.cpp:221] Preparing perf event cgroup for container ../../src/tests/isolator_tests.cpp:1067: Failure (isolator.get()-prepare( containerId, executorInfo, os::getcwd(), UNPRIVILEGED_USERNAME)).failure(): Failed to prepare isolator: cgroup already exists [
[jira] [Commented] (MESOS-2532) UserCgroupIsolatorTest failures due to: Failed to prepare isolator: cgroup already exists
[ https://issues.apache.org/jira/browse/MESOS-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376624#comment-14376624 ] Benjamin Mahler commented on MESOS-2532: Looked with [~jieyu], it looks like these tests generate their own non-unique container IDs and these can get left-over after the tests complete: {code} TYPED_TEST(UserCgroupIsolatorTest, ROOT_CGROUPS_UserCgroup) { // ... ContainerID containerId; containerId.set_value(container); {code} {noformat} [bmahler@smfd-atr-11-sr1 build]$ ls -l /sys/fs/cgroup/cpu/mesos/container total 0 -rw-r--r-- 1 root root 0 Jan 8 18:29 cgroup.clone_children --w--w--w- 1 root root 0 Jan 8 18:29 cgroup.event_control -rw-r--r-- 1 root root 0 Jan 8 18:29 cgroup.procs -rw-r--r-- 1 root root 0 Jan 8 18:29 cpu.cfs_period_us -rw-r--r-- 1 root root 0 Jan 8 18:29 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 Jan 8 18:29 cpu.rt_period_us -rw-r--r-- 1 root root 0 Jan 8 18:29 cpu.rt_runtime_us -rw-r--r-- 1 root root 0 Jan 8 18:29 cpu.shares -r--r--r-- 1 root root 0 Jan 8 18:29 cpu.stat -rw-r--r-- 1 root root 0 Jan 8 18:29 notify_on_release -rw-r--r-- 1 root root 0 Jan 8 18:29 tasks {noformat} UserCgroupIsolatorTest failures due to: Failed to prepare isolator: cgroup already exists --- Key: MESOS-2532 URL: https://issues.apache.org/jira/browse/MESOS-2532 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Benjamin Mahler Labels: twitter This is on a CentOS machine: {code: title=sudo make check -j24 MESOS_VERBOSE=1 GLOG_v=1 GTEST_FILTER=UserCgroupIsolatorTest*} - We cannot run any cgroups tests that require mounting hierarchies because you have the following hierarchies mounted: /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/freezer, /sys/fs/cgroup/memory, /sys/fs/cgroup/perf_event We'll disable the CgroupsNoHierarchyTest test fixture for now. - - We cannot run any Docker tests because: Failed to execute 'docker version': exited with status 127 - Note: Google Test filter = UserCgroupIsolatorTest*-DockerContainerizerTest.ROOT_DOCKER_Launch_Executor:DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged:DockerContainerizerTest.ROOT_DOCKER_Launch:DockerContainerizerTest.ROOT_DOCKER_Kill:DockerContainerizerTest.ROOT_DOCKER_Usage:DockerContainerizerTest.ROOT_DOCKER_Update:DockerContainerizerTest.DISABLED_ROOT_DOCKER_Recover:DockerContainerizerTest.ROOT_DOCKER_Logs:DockerContainerizerTest.ROOT_DOCKER_Default_CMD:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Override:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Args:DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer:DockerContainerizerTest.DISABLED_ROOT_DOCKER_SlaveRecoveryExecutorContainer:DockerContainerizerTest.ROOT_DOCKER_PortMapping:DockerContainerizerTest.ROOT_DOCKER_LaunchSandboxWithColon:DockerContainerizerTest.ROOT_DOCKER_DestroyWhileFetching:DockerContainerizerTest.ROOT_DOCKER_DestroyWhilePulling:DockerTest.ROOT_DOCKER_interface:DockerTest.ROOT_DOCKER_CheckCommandWithShell:DockerTest.ROOT_DOCKER_CheckPortResource:DockerTest.ROOT_DOCKER_CancelPull:CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy:SlaveCount/Registrar_BENCHMARK_Test.performance/0:SlaveCount/Registrar_BENCHMARK_Test.performance/1:SlaveCount/Registrar_BENCHMARK_Test.performance/2:SlaveCount/Registrar_BENCHMARK_Test.performance/3 [==] Running 3 tests from 3 test cases. [--] Global test environment set-up. [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user mesos.test.unprivileged.user does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup Using temporary directory '/tmp/UserCgroupIsolatorTest_0_ROOT_CGROUPS_UserCgroup_ASJu3B' ../../src/tests/isolator_tests.cpp:1067: Failure (isolator.get()-prepare( containerId, executorInfo, os::getcwd(), UNPRIVILEGED_USERNAME)).failure(): Failed to prepare isolator: cgroup already exists [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (18 ms) [--] 1 test from UserCgroupIsolatorTest/0 (18 ms total) [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess userdel: user mesos.test.unprivileged.user does not exist [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup Using temporary directory
[jira] [Updated] (MESOS-2528) Symlink the namespace handle with ContainerID for the port mapping isolator.
[ https://issues.apache.org/jira/browse/MESOS-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-2528: -- Labels: twitter (was: ) Symlink the namespace handle with ContainerID for the port mapping isolator. Key: MESOS-2528 URL: https://issues.apache.org/jira/browse/MESOS-2528 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Assignee: Jie Yu Labels: twitter This serves two purposes: 1) Allows us to enter the network namespace using container ID (instead of pid): ip netns exec ContainerID [commands] [args]. 2) Allows us to get container ID for orphan containers during recovery. This will be helpful for solving MESOS-2367. The challenge here is to solve it in a backward compatible way. I propose to create symlinks under /var/run/netns. For example: /var/run/netns/containerid -- /var/run/netns/12345 (12345 is the pid) The old code will only remove the bind mounts and leave the symlinks, which I think is fine since containerid is globally unique (uuid). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2514) Change the default leaf qdisc to fq_codel inside containers
[ https://issues.apache.org/jira/browse/MESOS-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-2514: -- Sprint: Twitter Mesos Q1 Sprint 5 Story Points: 1 Change the default leaf qdisc to fq_codel inside containers --- Key: MESOS-2514 URL: https://issues.apache.org/jira/browse/MESOS-2514 Project: Mesos Issue Type: Bug Reporter: Cong Wang Assignee: Cong Wang Fix For: 0.23.0 When we enable bandwidth cap, htb is used on egress side inside containers, however, the default leaf qdisc for a htb class is still pfifo_fast, which is known to have buffer bloat. Change the default leaf qdisc to fq_codel too: `tc qd add dev eth0 parent 1:1 fq_codel` I can no longer see packet drops after this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2514) Change the default leaf qdisc to fq_codel inside containers
[ https://issues.apache.org/jira/browse/MESOS-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu resolved MESOS-2514. --- Resolution: Fixed Fix Version/s: 0.23.0 commit d82ec92073b0438589e7aa72e608c3dc334a8dd6 Author: Cong Wang cw...@twopensource.com Date: Mon Mar 23 11:33:09 2015 -0700 Changed default htb leaf qdisc to fq_codel in port mapping isolator. Review: https://reviews.apache.org/r/32219 Change the default leaf qdisc to fq_codel inside containers --- Key: MESOS-2514 URL: https://issues.apache.org/jira/browse/MESOS-2514 Project: Mesos Issue Type: Bug Reporter: Cong Wang Assignee: Cong Wang Fix For: 0.23.0 When we enable bandwidth cap, htb is used on egress side inside containers, however, the default leaf qdisc for a htb class is still pfifo_fast, which is known to have buffer bloat. Change the default leaf qdisc to fq_codel too: `tc qd add dev eth0 parent 1:1 fq_codel` I can no longer see packet drops after this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-2531) Libmesos terminates JVM
[ https://issues.apache.org/jira/browse/MESOS-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen closed MESOS-2531. - Resolution: Duplicate Libmesos terminates JVM --- Key: MESOS-2531 URL: https://issues.apache.org/jira/browse/MESOS-2531 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0 Reporter: Michał Kiędyś Attachments: hs_err_pid98294.log I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs {{sleep}} command for 120 seconds and scaling that application to ten my Marathon crushed killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. h4. Log {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.state.GroupManager:142) [2015-03-23 15:47:17,877] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] PUT /v2/apps//bar HTTP/1.1 200 92 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:17,918] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] GET /v2/apps//bar/versions HTTP/1.1 200 68 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,722] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/apps HTTP/1.1 200 592 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,782] INFO Received status update for task bar.82501637-d16b-11e4-b7fa-aa4dda3d2dbb: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:149) [2015-03-23 15:47:20,790] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/deployments HTTP/1.1 200 256 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00012ec946f7, pid=98294, tid=27651 # # JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode bsd-amd64 compressed oops) # Problematic frame: # C [libmesos-0.23.0.dylib+0x7836f7] process::Futuremesos::internal::state::Variable::isFailed() const+0x17 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /Users/mkiedys/Downloads/MESOS/marathon/hs_err_pid98294.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Abort trap: 6 {noformat} h4. Java java version 1.8.0 Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode) h4. System Software Overview - System Version: OS X 10.10.2 (14C109) - Kernel Version: Darwin 14.1.0 - Secure Virtual Memory: Enabled - Time since boot: 13 days 11:02 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2531) Libmesos terminates JVM
[ https://issues.apache.org/jira/browse/MESOS-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376403#comment-14376403 ] Niklas Quarfot Nielsen commented on MESOS-2531: --- From the stack trace, it look like the state bug we have traced down (https://issues.apache.org/jira/browse/MESOS-2161) and a fix is in review: https://reviews.apache.org/r/32152/ Will mark as duplicate for now. Sorry for the inconvenience, but thanks for reporting the issue! Niklas {code} Stack: [0x00012d21d000,0x00012d29d000], sp=0x00012d29af30, free space=503k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libmesos-0.23.0.dylib+0x7836f7] process::Futuremesos::internal::state::Variable::isFailed() const+0x17 C [libmesos-0.23.0.dylib+0x1aa915a] Java_org_apache_mesos_state_AbstractState__1_1fetch_1get_1timeout+0xea J 5808 org.apache.mesos.state.AbstractState.__fetch_get_timeout(JJLjava/util/concurrent/TimeUnit;)Lorg/apache/mesos/state/Variable; (0 bytes) @ 0x00010db89402 [0x00010db89340+0xc2] J 6339 C1 mesosphere.marathon.tasks.TaskTracker$$anonfun$fetchFromState$1.apply()Lorg/apache/mesos/state/Variable; (51 bytes) @ 0x00010dcff17c [0x00010dcfec00+0x57c] J 6338 C1 mesosphere.marathon.tasks.TaskTracker$$anonfun$fetchFromState$1.apply()Ljava/lang/Object; (5 bytes) @ 0x00010dcffaf4 [0x00010dcffa00+0xf4] J 5007 C1 mesosphere.marathon.state.StateMetrics$class.timed(Lmesosphere/marathon/state/StateMetrics;Lcom/codahale/metrics/Histogram;Lcom/codahale/metrics/Meter;Lcom/codahale/metrics/Meter;Lscala/Function0;)Ljava/lang/Object; (48 bytes) @ 0x00010d944744 [0x00010d9445a0+0x1a4] J 5995 C1 mesosphere.marathon.tasks.TaskTracker.fetchFromState(Ljava/lang/String;)Lorg/apache/mesos/state/Variable; (17 bytes) @ 0x00010dc05b9c [0x00010dc05840+0x35c] {code} Libmesos terminates JVM --- Key: MESOS-2531 URL: https://issues.apache.org/jira/browse/MESOS-2531 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0 Reporter: Michał Kiędyś Attachments: hs_err_pid98294.log I have build Mesos from scratch using code available on GitHub, revision #a12242b. My Mesos cluster runs on MacOS and consists of one master and three slaves - all running on the same computer but on different ports. ZooKeeper runs also on the same computer. Later on I compiled Marathon also using latest version from GitHub, revision #6decf76. Marathon uses same ZooKeeper instance and successfully connects to Mesos cluster. After deploying simple application that runs {{sleep}} command for 120 seconds and scaling that application to ten my Marathon crushed killed by JVM after SIGSEGV in libmesos-0.23.0.dylib. h4. Log {noformat} [2015-03-23 15:47:17,872] INFO Computed new deployment plan: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.upgrade.DeploymentPlan$:263) [2015-03-23 15:47:17,876] INFO Deployment acknowledged. Waiting to get processed: DeploymentPlan(2015-03-23T14:47:17.823Z, (Step(List(Scale(App(/bar, Some(sleep 120))), 10) (mesosphere.marathon.state.GroupManager:142) [2015-03-23 15:47:17,877] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] PUT /v2/apps//bar HTTP/1.1 200 92 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:17,918] INFO 127.0.0.1 - - [23/mar/2015:14:47:17 +] GET /v2/apps//bar/versions HTTP/1.1 200 68 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,722] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/apps HTTP/1.1 200 592 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) [2015-03-23 15:47:20,782] INFO Received status update for task bar.82501637-d16b-11e4-b7fa-aa4dda3d2dbb: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:149) [2015-03-23 15:47:20,790] INFO 127.0.0.1 - - [23/mar/2015:14:47:20 +] GET /v2/deployments HTTP/1.1 200 256 http://127.0.0.1:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36 (mesosphere.chaos.http.ChaosRequestLog:15) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00012ec946f7, pid=98294, tid=27651 # # JRE version: Java(TM) SE Runtime Environment
[jira] [Commented] (MESOS-2425) TODO comment in mesos.proto is already implemented
[ https://issues.apache.org/jira/browse/MESOS-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376854#comment-14376854 ] Niklas Quarfot Nielsen commented on MESOS-2425: --- https://reviews.apache.org/r/31637/ TODO comment in mesos.proto is already implemented -- Key: MESOS-2425 URL: https://issues.apache.org/jira/browse/MESOS-2425 Project: Mesos Issue Type: Bug Components: general Affects Versions: 0.20.1 Reporter: Aaron Bell Assignee: Aaron Bell Priority: Minor Labels: mesosphere Attachments: mesos-2425-1.diff These lines are redundant in mesos.proto, since CommandInfo is now implemented: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L169-L174 I'm creating a patch with edits on comment lines only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2425) TODO comment in mesos.proto is already implemented
[ https://issues.apache.org/jira/browse/MESOS-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376853#comment-14376853 ] Adam B commented on MESOS-2425: --- Patch uploaded: https://reviews.apache.org/r/31637/diff/# TODO comment in mesos.proto is already implemented -- Key: MESOS-2425 URL: https://issues.apache.org/jira/browse/MESOS-2425 Project: Mesos Issue Type: Bug Components: general Affects Versions: 0.20.1 Reporter: Aaron Bell Assignee: Aaron Bell Priority: Minor Labels: mesosphere Attachments: mesos-2425-1.diff These lines are redundant in mesos.proto, since CommandInfo is now implemented: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L169-L174 I'm creating a patch with edits on comment lines only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2353) Improve performance of the master's state.json endpoint for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-2353: -- Assignee: Benjamin Mahler [~alex-mesos] I'll take up the move change. Improve performance of the master's state.json endpoint for large clusters. --- Key: MESOS-2353 URL: https://issues.apache.org/jira/browse/MESOS-2353 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: newbie, twitter The master's state.json endpoint consistently takes a long time to compute the JSON result, for large clusters: {noformat} $ time curl -s -o /dev/null localhost:5050/master/state.json Mon Jan 26 22:38:50 UTC 2015 real 0m13.174s user 0m0.003s sys 0m0.022s {noformat} This can cause the master to get backlogged if there are many state.json requests in flight. Looking at {{perf}} data, it seems most of the time is spent doing memory allocation / de-allocation. This ticket will try to capture any low hanging fruit to speed this up. Possibly we can leverage moves if they are not already being used by the compiler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2205) Add user documentation for reservations
[ https://issues.apache.org/jira/browse/MESOS-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376994#comment-14376994 ] Niklas Quarfot Nielsen commented on MESOS-2205: --- Hi [~mcypark] - how do you want to collaborate (and land) on this? As an arch doc in the repo or somewhere on the wiki? :) Add user documentation for reservations --- Key: MESOS-2205 URL: https://issues.apache.org/jira/browse/MESOS-2205 Project: Mesos Issue Type: Documentation Components: documentation, framework Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Add a user guide for reservations which describes basic usage of them, how ACLs are used to specify who can unreserve whose resources, and few advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2165) When cyrus sasl MD5 isn't installed configure passes, tests fail without any output
[ https://issues.apache.org/jira/browse/MESOS-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2165: -- Shepherd: Adam B When cyrus sasl MD5 isn't installed configure passes, tests fail without any output --- Key: MESOS-2165 URL: https://issues.apache.org/jira/browse/MESOS-2165 Project: Mesos Issue Type: Bug Reporter: Cody Maloney Assignee: Till Toenshoff Labels: mesosphere Sample Dockerfile to make such a host: {code} FROM centos:centos7 RUN yum install -y epel-release gcc python-devel RUN yum install -y python-pip RUN yum install -y rpm-build redhat-rpm-config autoconf make gcc gcc-c++ patch libtool git python-devel ruby-devel java-1.7.0-openjdk-devel zlib-devel libcurl-devel openssl-devel cyrus-sasl-devel rubygems apr-devel apr-util-devel subversion-devel maven libselinux-python {code} Use: 'docker run -i -t imagename /bin/bash' to run the image, get a shell inside where you can 'git clone' mesos and build/run the tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2510) Add a function which test if a JSON object is contained in another JSON object
[ https://issues.apache.org/jira/browse/MESOS-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2510: -- Sprint: Mesosphere Q1 Sprint 6 - 4/3 Add a function which test if a JSON object is contained in another JSON object -- Key: MESOS-2510 URL: https://issues.apache.org/jira/browse/MESOS-2510 Project: Mesos Issue Type: Wish Components: stout Reporter: Alexander Rojas Assignee: Alexander Rojas It would be nice to check wether one json blob is contained by other blob. i.e. given the json blob {{a}} and the blob {{b}}, {{a}} contains {{b}} if every key {{x}} in {{b}} is also in {{a}}, and {{b\[x\] == a\[x\]}} if {{b\[x\]}} is not a json object itself or, if it is a json object, {{a\[x\]}} contains {{b\[x\]}}. h3. Rationale One of the most useful patterns while testing functions which return json, is to write the expected result and then compare if the expected blob is equal to the returned one: {code} JSON::Value expected = JSON::parse( { \key\ : true }).get(); JSON::Value actual = foo(); CHECK_EQ(expected, actual); {code} As can be seen in the example above, it is easy to read what the expected value is, and checking for failures if fairly easy. It is no easy, however, to compare returned blobs which contain at least one random values (for example time stamps), or a value which is uninteresting for the test. In such cases it is necessary to extract each value separately and compare them: {code} // Returned json: // { // uptime : 45234.123, // key : true // } JSON::Value actual = bar(); // I'm only interested on the key entry. EXPECT_SOME_EQ(true, actual.findJSON::String(key)); {code} As seen above, is one is only interested in a subset of the keys/values pairs returned by {{bar}} the readability of the code decreases severely. It is worse if it weren't for the comments. The aim is to achieve the same level of readability on the first example while covering the case of the second: {code} JSON::Value expected = JSON::parse( { \key\ : true }).get(); // Returned json: // { // uptime : 45234.123, // key : true // } JSON::Value actual = bar(); // I'm only interested on the key entry and ignore the rest. EXPECT_TRUE(contains(actual, expected)); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2353) Improve performance of the master's state.json endpoint for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2353: --- Sprint: Twitter Mesos Q1 Sprint 5 Improve performance of the master's state.json endpoint for large clusters. --- Key: MESOS-2353 URL: https://issues.apache.org/jira/browse/MESOS-2353 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: newbie, twitter The master's state.json endpoint consistently takes a long time to compute the JSON result, for large clusters: {noformat} $ time curl -s -o /dev/null localhost:5050/master/state.json Mon Jan 26 22:38:50 UTC 2015 real 0m13.174s user 0m0.003s sys 0m0.022s {noformat} This can cause the master to get backlogged if there are many state.json requests in flight. Looking at {{perf}} data, it seems most of the time is spent doing memory allocation / de-allocation. This ticket will try to capture any low hanging fruit to speed this up. Possibly we can leverage moves if they are not already being used by the compiler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2353) Improve performance of the master's state.json endpoint for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377036#comment-14377036 ] Benjamin Mahler commented on MESOS-2353: https://reviews.apache.org/r/32419/ Improve performance of the master's state.json endpoint for large clusters. --- Key: MESOS-2353 URL: https://issues.apache.org/jira/browse/MESOS-2353 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: newbie, twitter The master's state.json endpoint consistently takes a long time to compute the JSON result, for large clusters: {noformat} $ time curl -s -o /dev/null localhost:5050/master/state.json Mon Jan 26 22:38:50 UTC 2015 real 0m13.174s user 0m0.003s sys 0m0.022s {noformat} This can cause the master to get backlogged if there are many state.json requests in flight. Looking at {{perf}} data, it seems most of the time is spent doing memory allocation / de-allocation. This ticket will try to capture any low hanging fruit to speed this up. Possibly we can leverage moves if they are not already being used by the compiler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-2425) TODO comment in mesos.proto is already implemented
[ https://issues.apache.org/jira/browse/MESOS-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2425: -- Comment: was deleted (was: https://reviews.apache.org/r/31637/) TODO comment in mesos.proto is already implemented -- Key: MESOS-2425 URL: https://issues.apache.org/jira/browse/MESOS-2425 Project: Mesos Issue Type: Bug Components: general Affects Versions: 0.20.1 Reporter: Aaron Bell Assignee: Aaron Bell Priority: Minor Labels: mesosphere Attachments: mesos-2425-1.diff These lines are redundant in mesos.proto, since CommandInfo is now implemented: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L169-L174 I'm creating a patch with edits on comment lines only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2533) Support HTTP checks in Mesos health check program
Niklas Quarfot Nielsen created MESOS-2533: - Summary: Support HTTP checks in Mesos health check program Key: MESOS-2533 URL: https://issues.apache.org/jira/browse/MESOS-2533 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Currently, only commands are supported but our health check protobuf enables users to encode HTTP checks as well. We should wire up this in the health check program or remove the http field from the protobuf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2534) PerfTest.ROOT_SampleInit test fails.
[ https://issues.apache.org/jira/browse/MESOS-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376980#comment-14376980 ] Ian Downes commented on MESOS-2534: --- https://reviews.apache.org/r/32420/ PerfTest.ROOT_SampleInit test fails. Key: MESOS-2534 URL: https://issues.apache.org/jira/browse/MESOS-2534 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Benjamin Mahler Assignee: Ian Downes Labels: twitter From MESOS-2300 as well, it looks like this test is not reliable: {code} [ RUN ] PerfTest.ROOT_SampleInit ../../src/tests/perf_tests.cpp:147: Failure Expected: (0u) (statistics.get().cycles()), actual: 0 vs 0 ../../src/tests/perf_tests.cpp:150: Failure Expected: (0.0) (statistics.get().task_clock()), {code} It looks like this test samples PID 1, which is either {{init}} or {{systemd}}. Per a chat with [~idownes] this should probably sample something that is guaranteed to be consuming cycles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2534) PerfTest.ROOT_SampleInit test fails.
[ https://issues.apache.org/jira/browse/MESOS-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-2534: -- Sprint: Twitter Mesos Q1 Sprint 5 PerfTest.ROOT_SampleInit test fails. Key: MESOS-2534 URL: https://issues.apache.org/jira/browse/MESOS-2534 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Benjamin Mahler Assignee: Ian Downes Labels: twitter From MESOS-2300 as well, it looks like this test is not reliable: {code} [ RUN ] PerfTest.ROOT_SampleInit ../../src/tests/perf_tests.cpp:147: Failure Expected: (0u) (statistics.get().cycles()), actual: 0 vs 0 ../../src/tests/perf_tests.cpp:150: Failure Expected: (0.0) (statistics.get().task_clock()), {code} It looks like this test samples PID 1, which is either {{init}} or {{systemd}}. Per a chat with [~idownes] this should probably sample something that is guaranteed to be consuming cycles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2534) PerfTest.ROOT_SampleInit test fails.
[ https://issues.apache.org/jira/browse/MESOS-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-2534: -- Story Points: 2 PerfTest.ROOT_SampleInit test fails. Key: MESOS-2534 URL: https://issues.apache.org/jira/browse/MESOS-2534 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Benjamin Mahler Assignee: Ian Downes Labels: twitter From MESOS-2300 as well, it looks like this test is not reliable: {code} [ RUN ] PerfTest.ROOT_SampleInit ../../src/tests/perf_tests.cpp:147: Failure Expected: (0u) (statistics.get().cycles()), actual: 0 vs 0 ../../src/tests/perf_tests.cpp:150: Failure Expected: (0.0) (statistics.get().task_clock()), {code} It looks like this test samples PID 1, which is either {{init}} or {{systemd}}. Per a chat with [~idownes] this should probably sample something that is guaranteed to be consuming cycles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2477) Enable Resources::apply to handle reservation operations.
[ https://issues.apache.org/jira/browse/MESOS-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated MESOS-2477: Shepherd: Timothy Chen Enable Resources::apply to handle reservation operations. - Key: MESOS-2477 URL: https://issues.apache.org/jira/browse/MESOS-2477 Project: Mesos Issue Type: Technical task Reporter: Michael Park Assignee: Michael Park Labels: mesosphere {{Resources::apply}} currently only handles {{Create}} and {{Destroy}} operations which exist for persistent volumes. We need to handle the {{Reserve}} and {{Unreserve}} operations for dynamic reservations as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2476) Enable Resources to handle Resource::ReservationInfo
[ https://issues.apache.org/jira/browse/MESOS-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated MESOS-2476: Shepherd: Timothy Chen Enable Resources to handle Resource::ReservationInfo Key: MESOS-2476 URL: https://issues.apache.org/jira/browse/MESOS-2476 Project: Mesos Issue Type: Technical task Reporter: Michael Park Assignee: Michael Park Labels: mesosphere After [MESOS-2475|https://issues.apache.org/jira/browse/MESOS-2475], our C++ {{Resources}} class needs to know how to handle {{Resource}} protobuf messages that have the {{reservation}} field set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2300) Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23
[ https://issues.apache.org/jira/browse/MESOS-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376902#comment-14376902 ] Benjamin Mahler commented on MESOS-2300: I've filed MESOS-2534 to address the perf test failure. Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23 --- Key: MESOS-2300 URL: https://issues.apache.org/jira/browse/MESOS-2300 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.1 Environment: (Though the hostname of this box is {{docker1}}, this is not running on a docker container. This box sits on vanilla hardware, and happens to also be used as a docker server. Though not when I ran the offending tests.) {code} huitseeker@docker1:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.10 Release: 14.10 Codename: utopic {code} {code} huitseeker@docker1:~$ uname -a Linux docker1 3.16.0-23-generic #31-Ubuntu SMP Tue Oct 21 17:56:17 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux }} {code} Mesos retrieved from {{http://git-wip-us.apache.org/repos/asf/mesos.git}} And compiled from git tag {{0.21.1}} (currently resolves to {{2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8}}). Box is a clean, ansible-generated Ubuntu with cgmanager disabled, and the following packages installed on top of the usual mesos dependencies: - cgroup-lite (service is enabled and started) - linux-tools-common - linux-tools-generic - linux-cloud-tools-generic - linux-tools-3.16.0-23-generic - linux-cloud-tools-3.16.0-23-generic Reporter: François Garillot Labels: cgroups, test During make check : {code} [--] Global test environment tear-down [==] 503 tests from 89 test cases ran. (387352 ms total) [ PASSED ] 499 tests. [ FAILED ] 4 tests, listed below: [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups [ FAILED ] NsTest.ROOT_setns [ FAILED ] PerfTest.ROOT_SampleInit {code} Details: {code} [ RUN ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get ../../src/tests/cgroups_tests.cpp:364: Failure Value of: mesos_test2 Expected: cgroups.get()[0] Which is: mesos [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get (10 ms) [ RUN ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups ../../src/tests/cgroups_tests.cpp:392: Failure Value of: path::join(TEST_CGROUPS_ROOT, 2) Actual: mesos_test/2 Expected: cgroups.get()[0] Which is: mesos_test/1 ../../src/tests/cgroups_tests.cpp:393: Failure Value of: path::join(TEST_CGROUPS_ROOT, 1) Actual: mesos_test/1 Expected: cgroups.get()[1] Which is: mesos_test/2 [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups (12 ms) {code} {code} [ RUN ] NsTest.ROOT_setns ../../src/tests/ns_tests.cpp:123: Failure Value of: status.get().get() Actual: 256 Expected: 0 [ FAILED ] NsTest.ROOT_setns (93 ms) {code} {code} [ RUN ] PerfTest.ROOT_SampleInit ../../src/tests/perf_tests.cpp:143: Failure Expected: (0u) (statistics.get().cycles()), actual: 0 vs 0 ../../src/tests/perf_tests.cpp:146: Failure Expected: (0.0) (statistics.get().task_clock()), actual: 0 vs 0 [ FAILED ] PerfTest.ROOT_SampleInit (1078 ms) {code} Those tests have been run in parallel (-j 8) as well as sequentially (-j 1), no difference. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2534) PerfTest.ROOT_SampleInit test fails.
Benjamin Mahler created MESOS-2534: -- Summary: PerfTest.ROOT_SampleInit test fails. Key: MESOS-2534 URL: https://issues.apache.org/jira/browse/MESOS-2534 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Benjamin Mahler Assignee: Ian Downes From MESOS-2300 as well, it looks like this test is not reliable: {code} [ RUN ] PerfTest.ROOT_SampleInit ../../src/tests/perf_tests.cpp:147: Failure Expected: (0u) (statistics.get().cycles()), actual: 0 vs 0 ../../src/tests/perf_tests.cpp:150: Failure Expected: (0.0) (statistics.get().task_clock()), {code} It looks like this test samples PID 1, which is either {{init}} or {{systemd}}. Per a chat with [~idownes] this should probably sample something that is guaranteed to be consuming cycles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2535) Improve Resources filters to refuse certain roles
Yan Xu created MESOS-2535: - Summary: Improve Resources filters to refuse certain roles Key: MESOS-2535 URL: https://issues.apache.org/jira/browse/MESOS-2535 Project: Mesos Issue Type: Improvement Reporter: Yan Xu We have certain use case where a framework only uses certain hosts in the cluster (e.g. either because they have some special hardware or just large disks/ram). The way we are currently implementing it is that the slaves on these hosts tag all their resources as belonged to a certain role and the scheduler only uses resources of that role. To make sure the framework plays nicely with other frameworks it also immediately declines resources of other roles (including '*' roles) infinitely. The current {{declineOffer()}} API however, is at the offer level, not the - framework level, so the framework can directly reject offers from all slaves collectively instead of individually, nor the - resource level, so that the framework can reject some resources from this offer but not all. The framework-level requirement is less of a problem because rejecting the offers individually achieves the same result, just with more message overhead. The resource level requirement, though, is hard to achieve with the current declineOffer() API. If the special slaves do not dedicate all its resources to one framework but rather, have some resources with a certain role that I want to reject for 5 seconds (due to an idle scheduler) and some other resources that I want to reject forever (e.g. due to role mismatch), there is no way to do that. Some improvement on the resource filters that solve this would be nice. Such improvement should make sure that we are still able to revive the offers (or rather, directly, remove the filters that do not tie to an offer) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1831) Master should send PingSlaveMessage instead of PING
[ https://issues.apache.org/jira/browse/MESOS-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1831: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3) Master should send PingSlaveMessage instead of PING - Key: MESOS-1831 URL: https://issues.apache.org/jira/browse/MESOS-1831 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Adam B Labels: mesosphere In 0.21.0 master sends PING message with an embedded PingSlaveMessage for backwards compatibility (https://reviews.apache.org/r/25867/). In 0.22.0, master should send PingSlaveMessage directly instead of PING. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2110) Configurable Ping Timeouts
[ https://issues.apache.org/jira/browse/MESOS-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2110: -- Shepherd: Niklas Quarfot Nielsen Configurable Ping Timeouts -- Key: MESOS-2110 URL: https://issues.apache.org/jira/browse/MESOS-2110 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Adam B Assignee: Adam B Labels: master, network, slave, timeout After a series of ping-failures, the master considers the slave lost and calls shutdownSlave, requiring such a slave that reconnects to kill its tasks and re-register as a new slaveId. On the other side, after a similar timeout, the slave will consider the master lost and try to detect a new master. These timeouts are currently hardcoded constants (5 * 15s), which may not be well-suited for all scenarios. - Some clusters may tolerate a longer slave process restart period, and wouldn't want tasks to be killed upon reconnect. - Some clusters may have higher-latency networks (e.g. cross-datacenter, or for volunteer computing efforts), and would like to tolerate longer periods without communication. We should provide flags/mechanisms on the master to control its tolerance for non-communicative slaves, and (less importantly?) on the slave to tolerate missing masters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2351) Enable label and environment decorators (hooks) to remove label and environment entries
[ https://issues.apache.org/jira/browse/MESOS-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2351: -- Shepherd: Adam B Enable label and environment decorators (hooks) to remove label and environment entries --- Key: MESOS-2351 URL: https://issues.apache.org/jira/browse/MESOS-2351 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Niklas Quarfot Nielsen We need to change the semantics of decorators to be able to not only add labels and environment variables, but also remove them. The change is fairly small. The hook manager (and call site) use CopyFrom instead of MergeFrom and hook implementors pass on the labels and environment from task and executor commands respectively. In the future, we can tag labels such that only labels belonging to a hook type (across master and slave) can be inspected and changed. For now, the active hooks are selected by the operator and therefore be trusted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2317) Remove deprecated checkpoint=false code
[ https://issues.apache.org/jira/browse/MESOS-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377020#comment-14377020 ] Adam B commented on MESOS-2317: --- https://reviews.apache.org/r/31539/ Remove deprecated checkpoint=false code --- Key: MESOS-2317 URL: https://issues.apache.org/jira/browse/MESOS-2317 Project: Mesos Issue Type: Epic Affects Versions: 0.22.0 Reporter: Adam B Assignee: Joerg Schad Labels: checkpoint, mesosphere Cody's plan from MESOS-444 was: 1) Make it so the flag can't be changed at the command line 2) Remove the checkpoint variable entirely from slave/flags.hpp. This is a fairly involved change since a number of unit tests depend on manually setting the flag, as well as the default being non-checkpointing. 3) Remove logic around checkpointing in the slave 4) Drop the flag from the SlaveInfo struct, remove logic inside the master (Will require a deprecation cycle). Only 1) has been implemented/committed. This ticket is to track the remaining work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2226) HookTest.VerifySlaveLaunchExecutorHook is flaky
[ https://issues.apache.org/jira/browse/MESOS-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2226: -- Shepherd: Niklas Quarfot Nielsen HookTest.VerifySlaveLaunchExecutorHook is flaky --- Key: MESOS-2226 URL: https://issues.apache.org/jira/browse/MESOS-2226 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Vinod Kone Assignee: Kapil Arya Labels: flaky-test Observed this on internal CI {code} [ RUN ] HookTest.VerifySlaveLaunchExecutorHook Using temporary directory '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME' I0114 18:51:34.659353 4720 leveldb.cpp:176] Opened db in 1.255951ms I0114 18:51:34.662112 4720 leveldb.cpp:183] Compacted db in 596090ns I0114 18:51:34.662364 4720 leveldb.cpp:198] Created db iterator in 177877ns I0114 18:51:34.662719 4720 leveldb.cpp:204] Seeked to beginning of db in 19709ns I0114 18:51:34.663010 4720 leveldb.cpp:273] Iterated through 0 keys in the db in 18208ns I0114 18:51:34.663312 4720 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0114 18:51:34.664266 4735 recover.cpp:449] Starting replica recovery I0114 18:51:34.664908 4735 recover.cpp:475] Replica is in EMPTY status I0114 18:51:34.667842 4734 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0114 18:51:34.669117 4735 recover.cpp:195] Received a recover response from a replica in EMPTY status I0114 18:51:34.677913 4735 recover.cpp:566] Updating replica status to STARTING I0114 18:51:34.683157 4735 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 137939ns I0114 18:51:34.683507 4735 replica.cpp:323] Persisted replica status to STARTING I0114 18:51:34.684013 4735 recover.cpp:475] Replica is in STARTING status I0114 18:51:34.685554 4738 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0114 18:51:34.696512 4736 recover.cpp:195] Received a recover response from a replica in STARTING status I0114 18:51:34.700552 4735 recover.cpp:566] Updating replica status to VOTING I0114 18:51:34.701128 4735 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 115624ns I0114 18:51:34.701478 4735 replica.cpp:323] Persisted replica status to VOTING I0114 18:51:34.701817 4735 recover.cpp:580] Successfully joined the Paxos group I0114 18:51:34.702569 4735 recover.cpp:464] Recover process terminated I0114 18:51:34.716439 4736 master.cpp:262] Master 20150114-185134-2272962752-57018-4720 (fedora-19) started on 192.168.122.135:57018 I0114 18:51:34.716913 4736 master.cpp:308] Master only allowing authenticated frameworks to register I0114 18:51:34.717136 4736 master.cpp:313] Master only allowing authenticated slaves to register I0114 18:51:34.717488 4736 credentials.hpp:36] Loading credentials for authentication from '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME/credentials' I0114 18:51:34.718077 4736 master.cpp:357] Authorization enabled I0114 18:51:34.719238 4738 whitelist_watcher.cpp:65] No whitelist given I0114 18:51:34.719755 4737 hierarchical_allocator_process.hpp:285] Initialized hierarchical allocator process I0114 18:51:34.722584 4736 master.cpp:1219] The newly elected leader is master@192.168.122.135:57018 with id 20150114-185134-2272962752-57018-4720 I0114 18:51:34.722865 4736 master.cpp:1232] Elected as the leading master! I0114 18:51:34.723310 4736 master.cpp:1050] Recovering from registrar I0114 18:51:34.723760 4734 registrar.cpp:313] Recovering registrar I0114 18:51:34.725229 4740 log.cpp:660] Attempting to start the writer I0114 18:51:34.727893 4739 replica.cpp:477] Replica received implicit promise request with proposal 1 I0114 18:51:34.728425 4739 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 114781ns I0114 18:51:34.728662 4739 replica.cpp:345] Persisted promised to 1 I0114 18:51:34.731271 4741 coordinator.cpp:230] Coordinator attemping to fill missing position I0114 18:51:34.733223 4734 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 2 I0114 18:51:34.734076 4734 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 87441ns I0114 18:51:34.734441 4734 replica.cpp:679] Persisted action at 0 I0114 18:51:34.740272 4739 replica.cpp:511] Replica received write request for position 0 I0114 18:51:34.740910 4739 leveldb.cpp:438] Reading position from leveldb took 59846ns I0114 18:51:34.741672 4739 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 189259ns I0114 18:51:34.741919 4739 replica.cpp:679] Persisted action at 0 I0114 18:51:34.743000 4739 replica.cpp:658] Replica
[jira] [Closed] (MESOS-2525) Missing information in Python interface launchTasks scheduler method
[ https://issues.apache.org/jira/browse/MESOS-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen closed MESOS-2525. - Resolution: Fixed commit 89186759a593c57fca9667d2a8980ca2e4b929c6 Author: Itamar Ostricher ita...@yowza3d.com Date: Mon Mar 23 17:46:56 2015 -0700 Updated launchTasks scheduler Python API docstring. Review: https://reviews.apache.org/r/32306 Missing information in Python interface launchTasks scheduler method Key: MESOS-2525 URL: https://issues.apache.org/jira/browse/MESOS-2525 Project: Mesos Issue Type: Documentation Components: python api Affects Versions: 0.21.0 Reporter: Itamar Ostricher Labels: documentation, newbie, patch Original Estimate: 1m Remaining Estimate: 1m The docstring of the launchTasks scheduler method in the Python API should explicitly state that launching multiple tasks onto multiple offers is supported only as long as all offers are from the same slave. See mailing list thread: http://www.mail-archive.com/user@mesos.apache.org/msg02861.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2317) Remove deprecated checkpoint=false code
[ https://issues.apache.org/jira/browse/MESOS-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2317: -- Sprint: Mesosphere Q1 Sprint 6 - 4/3 Epic Name: Slave Checkpointing Deprecation Shepherd: Adam B Labels: checkpoint mesosphere (was: checkpoint) Remove deprecated checkpoint=false code --- Key: MESOS-2317 URL: https://issues.apache.org/jira/browse/MESOS-2317 Project: Mesos Issue Type: Epic Affects Versions: 0.22.0 Reporter: Adam B Assignee: Joerg Schad Labels: checkpoint, mesosphere Cody's plan from MESOS-444 was: 1) Make it so the flag can't be changed at the command line 2) Remove the checkpoint variable entirely from slave/flags.hpp. This is a fairly involved change since a number of unit tests depend on manually setting the flag, as well as the default being non-checkpointing. 3) Remove logic around checkpointing in the slave 4) Drop the flag from the SlaveInfo struct, remove logic inside the master (Will require a deprecation cycle). Only 1) has been implemented/committed. This ticket is to track the remaining work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2436) Adapt unit test relying on non-checkpointing slaves
[ https://issues.apache.org/jira/browse/MESOS-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2436: -- Fix Version/s: 0.23.0 Labels: mesosphere (was: ) Adapt unit test relying on non-checkpointing slaves --- Key: MESOS-2436 URL: https://issues.apache.org/jira/browse/MESOS-2436 Project: Mesos Issue Type: Technical task Reporter: Joerg Schad Assignee: Joerg Schad Labels: mesosphere Fix For: 0.23.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2436) Adapt unit test relying on non-checkpointing slaves
[ https://issues.apache.org/jira/browse/MESOS-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B resolved MESOS-2436. --- Resolution: Fixed Adapt unit test relying on non-checkpointing slaves --- Key: MESOS-2436 URL: https://issues.apache.org/jira/browse/MESOS-2436 Project: Mesos Issue Type: Technical task Reporter: Joerg Schad Assignee: Joerg Schad Labels: mesosphere Fix For: 0.23.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2436) Adapt unit test relying on non-checkpointing slaves
[ https://issues.apache.org/jira/browse/MESOS-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377109#comment-14377109 ] Adam B commented on MESOS-2436: --- commit b2f73095fd168a75c2754f26d5368f4cff414752 Author: Joerg Schad jo...@mesosphere.io Date: Mon Mar 23 17:03:28 2015 -0700 Remove the checkpoint variable entirely from slave/flags.hpp. As a number of tests rely on the checkpointing flag to be false, a few tests had to be adapted. Removed the following test as the tested logic is specific to (old) non-checkpointing slaves: SlaveRecoveryTest.NonCheckpointingSlave: This test checks whether a non-checkpointing slave is not scheduled to a checkpointing framework. It can be removed as all slaves are now checkpointing slaves. Review: https://reviews.apache.org/r/31539 Adapt unit test relying on non-checkpointing slaves --- Key: MESOS-2436 URL: https://issues.apache.org/jira/browse/MESOS-2436 Project: Mesos Issue Type: Technical task Reporter: Joerg Schad Assignee: Joerg Schad -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2436) Adapt unit test relying on non-checkpointing slaves
[ https://issues.apache.org/jira/browse/MESOS-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377119#comment-14377119 ] Adam B commented on MESOS-2436: --- subtask Adapt unit test relying on non-checkpointing slaves --- Key: MESOS-2436 URL: https://issues.apache.org/jira/browse/MESOS-2436 Project: Mesos Issue Type: Technical task Reporter: Joerg Schad Assignee: Joerg Schad Labels: mesosphere Fix For: 0.23.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2436) Adapt unit test relying on non-checkpointing slaves
[ https://issues.apache.org/jira/browse/MESOS-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377113#comment-14377113 ] Adam B commented on MESOS-2436: --- I believe this issue can be resolved now, since all that's left is removing the flag from the SlaveInfo struct, and removing logic inside the master, which can be tracked by another ticket under the parent epic MESOS-2317 Adapt unit test relying on non-checkpointing slaves --- Key: MESOS-2436 URL: https://issues.apache.org/jira/browse/MESOS-2436 Project: Mesos Issue Type: Technical task Reporter: Joerg Schad Assignee: Joerg Schad Labels: mesosphere Fix For: 0.23.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2317) Remove deprecated checkpoint=false code
[ https://issues.apache.org/jira/browse/MESOS-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377129#comment-14377129 ] Adam B commented on MESOS-2317: --- Looks like steps 1, 2, and 3? are committed. Time for a new issue for step 4, and then it's all done! Remove deprecated checkpoint=false code --- Key: MESOS-2317 URL: https://issues.apache.org/jira/browse/MESOS-2317 Project: Mesos Issue Type: Epic Affects Versions: 0.22.0 Reporter: Adam B Assignee: Joerg Schad Labels: checkpoint, mesosphere Cody's plan from MESOS-444 was: 1) Make it so the flag can't be changed at the command line 2) Remove the checkpoint variable entirely from slave/flags.hpp. This is a fairly involved change since a number of unit tests depend on manually setting the flag, as well as the default being non-checkpointing. 3) Remove logic around checkpointing in the slave 4) Drop the flag from the SlaveInfo struct, remove logic inside the master (Will require a deprecation cycle). Only 1) has been implemented/committed. This ticket is to track the remaining work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)