I'll make sure this gets fixed for 1.0. Apologies for the pain, it looks like there is a significant amount of debt in the docker containerizer / executor.
On Wed, May 18, 2016 at 10:51 AM, Steven Schlansker < sschlans...@opentable.com> wrote: > > > On May 18, 2016, at 10:44 AM, haosdent <haosd...@gmail.com> wrote: > > > > >In re executor_shutdown_grace_period: how would this enable the task > (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo > STDOUT as captured by Mesos shows that it received signal 15 just before it > said good-bye). My naive understanding of this grace period is that it > simply delays the termination of the executor. > > I'm not 100% sure this is related or helpful, but be aware that we believe > there is a bug in the Docker > containerizer's handling of logs during shutdown: > > https://issues.apache.org/jira/browse/MESOS-5195 > > We spent a lot of time debugging why our application was not shutting down > as we expected, > only to find that the real problem was that Mesos was losing all logs sent > during > shutdown. > > > > > If you use DockerContainerizer, mesos use executor_shutdown_grace_period > as the shutdown gracefully timeout for task as well. If you use > MesosContainerizer, it would send SIGTERM(15) first. After 3 seconds, if > the task is still alive, Mesos would send SIGKILL(9) to the task again. > > > > >I'm not sure what the java task is. This took place on the mesos-master > node and none of our applications runs there. It runs master, Marathon, and > ZK. Maybe the java task is Marathon or ZK? > > > > Not sure about this, maybe others have similar experience on this, do > Marathon or Zookeeper abnormal at that time? Could you provide the log of > mesos-master/mesos-slave when accident happened as well? > > > > > > On Wed, May 18, 2016 at 7:11 PM, Paul Bell <arach...@gmail.com> wrote: > > Hi Hasodent, > > > > Thanks for your reply. > > > > In re executor_shutdown_grace_period: how would this enable the task > (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo > STDOUT as captured by Mesos shows that it received signal 15 just before it > said good-bye). My naive understanding of this grace period is that it > simply delays the termination of the executor. > > > > The following snippet is rom /var/log/syslog. I believe it shows the > stack trace (largely in the kernel) that led to mesos-master being blocked > for more than 120 seconds. Please note that immediately above (before) the > blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the > blocked mesos-master is a blocked java task. I'm not sure what the java > task is. This took place on the mesos-master node and none of our > applications runs there. It runs master, Marathon, and ZK. Maybe the java > task is Marathon or ZK? > > > > Thanks again. > > > > -Paul > > May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013 > blocked for more than 120 seconds. > > > > May 16 20:06:53 71 kernel: [193339.890873] Not tainted > 3.13.0-32-generic #57-Ubuntu > > > > May 16 20:06:53 71 kernel: [193339.890889] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > May 16 20:06:53 71 kernel: [193339.890912] mesos-master D > ffff88013fd94440 0 4013 1 0x00000000 > > > > May 16 20:06:53 71 kernel: [193339.890914] ffff880137429a28 > 0000000000000002 ffff880135778000 ffff880137429fd8 > > > > May 16 20:06:53 71 kernel: [193339.890916] 0000000000014440 > 0000000000014440 ffff880135778000 ffff88013fd94cd8 > > > > May 16 20:06:53 71 kernel: [193339.890918] ffff88013ffd34b0 > 0000000000000002 ffffffff81284630 ffff880137429aa0 > > > > May 16 20:06:53 71 kernel: [193339.890919] Call Trace: > > > > May 16 20:06:53 71 kernel: [193339.890922] [<ffffffff81284630>] ? > start_this_handle+0x590/0x590 > > > > May 16 20:06:53 71 kernel: [193339.890924] [<ffffffff817203fd>] > io_schedule+0x9d/0x140 > > > > May 16 20:06:53 71 kernel: [193339.890925] [<ffffffff8128463e>] > sleep_on_shadow_bh+0xe/0x20 > > > > May 16 20:06:53 71 kernel: [193339.890927] [<ffffffff81720882>] > __wait_on_bit+0x62/0x90 > > > > May 16 20:06:53 71 kernel: [193339.890929] [<ffffffff81284630>] ? > start_this_handle+0x590/0x590 > > > > May 16 20:06:53 71 kernel: [193339.890930] [<ffffffff81720927>] > out_of_line_wait_on_bit+0x77/0x90 > > > > May 16 20:06:53 71 kernel: [193339.890932] [<ffffffff810aaf40>] ? > autoremove_wake_function+0x40/0x40 > > > > May 16 20:06:53 71 kernel: [193339.890934] [<ffffffff810aab15>] ? > wake_up_bit+0x25/0x30 > > > > May 16 20:06:53 71 kernel: [193339.890936] [<ffffffff81285a3d>] > do_get_write_access+0x2ad/0x4f0 > > > > May 16 20:06:53 71 kernel: [193339.890938] [<ffffffff811ef90d>] ? > __getblk+0x2d/0x2e0 > > > > May 16 20:06:53 71 kernel: [193339.890939] [<ffffffff81285ca7>] > jbd2_journal_get_write_access+0x27/0x40 > > > > May 16 20:06:53 71 kernel: [193339.890942] [<ffffffff8126c37b>] > __ext4_journal_get_write_access+0x3b/0x80 > > > > May 16 20:06:53 71 kernel: [193339.890946] [<ffffffff81242200>] > ext4_reserve_inode_write+0x70/0xa0 > > > > May 16 20:06:53 71 kernel: [193339.890948] [<ffffffff81245490>] ? > ext4_dirty_inode+0x40/0x60 > > > > May 16 20:06:53 71 kernel: [193339.890949] [<ffffffff81242274>] > ext4_mark_inode_dirty+0x44/0x1f0 > > > > May 16 20:06:53 71 kernel: [193339.890951] [<ffffffff81245490>] > ext4_dirty_inode+0x40/0x60 > > > > May 16 20:06:53 71 kernel: [193339.890953] [<ffffffff811e65aa>] > __mark_inode_dirty+0x10a/0x2d0 > > > > May 16 20:06:53 71 kernel: [193339.890956] [<ffffffff811d7761>] > update_time+0x81/0xd0 > > > > May 16 20:06:53 71 kernel: [193339.890957] [<ffffffff811d7970>] > file_update_time+0x80/0xd0 > > > > May 16 20:06:53 71 kernel: [193339.890961] [<ffffffff8114fc40>] > __generic_file_aio_write+0x180/0x3d0 > > > > May 16 20:06:53 71 kernel: [193339.890963] [<ffffffff8114fee8>] > generic_file_aio_write+0x58/0xa0 > > > > May 16 20:06:53 71 kernel: [193339.890965] [<ffffffff81239ea9>] > ext4_file_write+0x99/0x400 > > > > May 16 20:06:53 71 kernel: [193339.890967] [<ffffffff8109a7f0>] ? > wake_up_state+0x10/0x20 > > > > May 16 20:06:53 71 kernel: [193339.890970] [<ffffffff810d7c56>] ? > wake_futex+0x66/0x90 > > > > May 16 20:06:53 71 kernel: [193339.890972] [<ffffffff810d8c01>] ? > futex_wake+0x1b1/0x1d0 > > > > May 16 20:06:53 71 kernel: [193339.890974] [<ffffffff811bc3da>] > do_sync_write+0x5a/0x90 > > > > May 16 20:06:53 71 kernel: [193339.890976] [<ffffffff811bcb64>] > vfs_write+0xb4/0x1f0 > > > > May 16 20:06:53 71 kernel: [193339.890978] [<ffffffff811bd599>] > SyS_write+0x49/0xa0 > > > > May 16 20:06:53 71 kernel: [193339.890980] [<ffffffff8172c87f>] > tracesys+0xe1/0xe6 > > > > > > > > > > On Wed, May 18, 2016 at 2:33 AM, haosdent <haosd...@gmail.com> wrote: > > >Is there some way to be given control (a callback, or an "exit" > routine) so that the container about to be nuked can be given a chance to > exit gracefully? > > The default value of executor_shutdown_grace_period is 5 seconds, you > could change it by specify the `--executor_shutdown_grace_period` flag when > launch mesos agent. > > > > >Are there other steps I can take to avoid this mildly calamitous > occurrence? > > >mesos-slaves get shutdown > > Do you know where your mesos-master stuck when it happens? Any error log > or related log about this? In addition, is there any log when mesos-slave > shut down? > > > > On Wed, May 18, 2016 at 6:12 AM, Paul Bell <arach...@gmail.com> wrote: > > Hi All, > > > > I probably have the following account partly wrong, but let me present > it just the same and those who know better can correct me as needed. > > > > I've an application that runs several MongoDB shards, each a Dockerized > container, each on a distinct node (VM); in fact, some of the VMs are on > separate ESXi hosts. > > > > I've lately seen situations where, because of very slow disks for the > database, the following sequence occurs (I think): > > • Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds > defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both) > > • Synchronous flushing of many, many pages occurs, writing to a > slow disk > > • (Around this time one might see in /var/log/syslog "task X > blocked for more than 120 seconds" for all kinds of tasks, including > mesos-master) > > • mesos-slaves get shutdown (this is the part I'm unclear about; > but I am quite certain that on 2 nodes the executors and their in-flight > MongoDB tasks got zapped because I can see that Marathon restarted them). > > The consequences of this are a corrupt MongoDB database. In the case at > hand, the job had run for over 50 hours, processing close to 120 million > files. > > > > Steps I've taken so far to remedy include: > > • tune vm.dirty_background_ratio and vm.dirty_ratio down, > respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate > more frequent, smaller flushes and thus avoid less frequent massive flushes > that suspend threads for very long periods. > > • increase agent ping timeout to 10 minutes (every 30 seconds, 20 > times) > > So the questions are: > > • Is there some way to be given control (a callback, or an "exit" > routine) so that the container about to be nuked can be given a chance to > exit gracefully? > > • Are there other steps I can take to avoid this mildly calamitous > occurrence? > > • (Also, I'd be grateful for more clarity on anything in steps 1-4 > above that is a bit hand-wavy!) > > As always, thanks. > > > > -Paul > > > > > > > > > > > > -- > > Best Regards, > > Haosdent Huang > > > > > > > > > > -- > > Best Regards, > > Haosdent Huang > >