[Devel] Re: C/R minisummit notes (namespace naming)
Daniel Lezcano [EMAIL PROTECTED] writes: Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. My question remains: who cares? The container implementation in userspace. Let's imagine it sets some routes outside of the container to route the traffic to the container. It should remove these routes when the container dies. And the container should be considered as dead when the network has died and not when the last process of the container exits. Namespaces can definitely live on long past the time when there are any tasks that point to them from nsproxy, and knowing when that happens would be nice. So settling on pids for names would be nice as that would allows us to restructure /proc so that we could see those kinds of things. That said I am less certain of the need to actually wait for a network namespace to exit, once we start killing virtual network devices. It was mentioned that ip over ip tunnels don't currently have a dellink method so we need will still need a wait to handle that case. Similarly in general we need to wait until the network namespace exits to ensure we flush all of the outgoing packets at container shutdown. So I propose we remove merge the code to wait on delete virtual devices and then recheck to see what uses we actually have left. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Quoting Daniel Lezcano ([EMAIL PROTECTED]): Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. My question remains: who cares? -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Serge E. Hallyn wrote: Quoting Daniel Lezcano ([EMAIL PROTECTED]): Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. My question remains: who cares? In the context of CR, you'd care if you migrate a container including its network stack. In that case, you wanna make sure that: (1) you save sockets that have data in their (send) queue but otherwise not attached to any specific process, and (2) you disable these sockets at the source machine as soon as you enable the container on the target machine. Rethinking this, Serge is probably right because one you migrate the network to the target node, you disable the network (of that container) on the source node, so you don't care about #2 there anymore... Oren. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Serge E. Hallyn wrote: Quoting Daniel Lezcano ([EMAIL PROTECTED]): Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. My question remains: who cares? The container implementation in userspace. Let's imagine it sets some routes outside of the container to route the traffic to the container. It should remove these routes when the container dies. And the container should be considered as dead when the network has died and not when the last process of the container exits. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Serge E. Hallyn [EMAIL PROTECTED] writes: No no, the idea (IIUC) is that if you want to do a very short-downtime migrate, you stay in step 1 for a long time, writing the container memory to disk, checking how different the disk img is from the memory image, updating the version on disk, checking again, etc. Then when you decide that the disk and memory are very close together, you quickly do steps 2-4, where 4 in this case is kill. In the meantime you would have been loading the disk data into memory ahead of time at the new machine, so you can also quickly complete the restart. So 3, 'Dump', in this case really becomes dump the metadata and any more changes that have happened. Presumably, if when you get to 3, you find that there was suddenly a lot of activity and there is too much data to write quickly, you bail on the migrate and step 4 is a resume rather than kill. Then you start again at step 1. At least that was my understanding. Yes. Too some extent you need those steps separate in the kernel so you can coordinate with filesystem snapshots and the like. Despite being in one large syscall we still have a few small other pieces of userspace we need to coordinate with. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Let's have some more breakfast, tomorrow - Friday - morning. Same place, same time. If it doesn't rain we'll go outside ;) Oren. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Oren Laadan wrote: Let's have some more breakfast, tomorrow - Friday - morning. Same place, same time. If it doesn't rain we'll go outside ;) Acked-by: Daniel Lezcano [EMAIL PROTECTED] ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Daniel Lezcano [EMAIL PROTECTED] writes: * What are the problems that the linux community can solve with the checkpoint/restart ? Eric Biederman reminds at the previous OLS nobody complained about the checkpoint/restart Kernel summit. Not OLS. Which is a room packed full of maintainers. It isn't an endorsement but it also such a scary idea that people immediately rejected it either. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Quoting Oren Laadan ([EMAIL PROTECTED]): Hi, I've placed a somewhat more detailed summary on the wiki: http://wiki.openvz.org/Containers/Mini-summit_2008_notes (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008) To further discuss technical details, let's schedule to meet while we are here for the OLS. I suggest the following for a start: 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :) 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center. Please confirm your participation. Hey, I've committed to dinner with another group. I'm definately up for breakfast. So I'll be at 3d floor congress center at mall entrance at 8:30am. thanks, -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
Re: [Devel] Re: C/R minisummit notes
On Wed, 2008-07-23 at 14:55 -0400, Oren Laadan wrote: Hi, I've placed a somewhat more detailed summary on the wiki: http://wiki.openvz.org/Containers/Mini-summit_2008_notes (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008) To further discuss technical details, let's schedule to meet while we are here for the OLS. I suggest the following for a start: 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :) +1 I think that we could meet at the registration desc. Any objections? 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center. +1 Regards, Den ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Oren Laadan wrote: Hi, I've placed a somewhat more detailed summary on the wiki: http://wiki.openvz.org/Containers/Mini-summit_2008_notes (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008) To further discuss technical details, let's schedule to meet while we are here for the OLS. I suggest the following for a start: 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :) 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center. Please confirm your participation. Benjamin, Dave and I, we will come. Pavel, Denis and Andrey will come too. I am looking for the Kerrighed guys to come too. We meet in the hall of the congress at 7:00pm ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Quoting Daniel Lezcano ([EMAIL PROTECTED]): * What are the problems that the linux community can solve with the checkpoint/restart ? Eric Biederman reminds at the previous OLS nobody complained about the checkpoint/restart Pavel Emylianov : The startup of Oracle takes some minutes, if we checkpoint just after the startup, Oracle can be restarted from this point later and provide fast startup Oren Laaden : Time travel, we can do monotonic snapshot and go back on one of this snaphost. Eric Biedreman : Priority running, checkpoint/kill an application and run another application with a bigger priority Denis Lunev : Task migration, move application on one host to another host Daniel Lezcano : SSI (task migration) * Preparing the kernel internals OL : Can we implement a kernel module and move CR functionality into the kernel itself later ? EB : Better to add a little CR functionnality into the kernel itself and add more after. DLu : Problem with kernel version OL : Compatibility with intermediate kernel version should be possible with userspace conversion tools DLu : Non sequential file for checkpoint statefile is a challenge OL : yes, but possible and useful for compression/encryption We showed that there are five steps to realize a checkpoint: 1 - Pre-dump I'd just add here that the pre-dump is where you might start writing memory to disk, trying to get disk and memory closer and closer to being the same until, at some point, you decide they are close enough that you can go on to step two, and attempt the freeze+dump+migrate/kill with minimal downtime. Coming into the discussion my primary concern had been that doing a sys_checkpoint() system call would be tough to augment to provide this kind of incremental checkpoint, but this breakdown is great for that. 2 - Freeze 3 - Dump 4 - Resume/kill 5 - Post-dump At this point we state we want create a proof of concept and checkpoint/restart the simplest application. By which we mean, start with a piece of step 3 (and maybe a bit of step 4). Step 2 was pretty widely accepted to be the freezer subsystem, but noone seemed to be sure quite what the status of that was. Matt, can you remind us how the freezer cgroup is doing? We will add iteratively more and more kernel resources. Process hierarchy created from kernel or userspace ? OL : Seems better to send a chunk of data to kernel and that restores the processes hierarchy PE : Agreed OL : We should be able to checkpoint from inside the container, keep that in mind for later. = we need a syscall or a ioctl The first items to address before implementing the Checkpoint are: 1 - Make a container object (the context) 2 - Freeze the container (extend cgroup freezer ?) 3 - syscall | ioctl First step: * simplest application : A single process, without any file, no checkpoint of text file (same file system for restart), no signals, no syscall in the application, no ipc/no msgq, no network Second step: * multiple processes + zombie state Third step: * files, pipe, signals, socketpair ? This proof of concept must came with a documentation describing what is supported, what is not supported and what we plan to do. And there was talk of making sure that if you attempt to checkpoint an app using unsupported resources, we return -EAGAIN. There had been murmurings about giving more meaningful feedback, but I have no idea what that would look like. -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Serge E. Hallyn wrote: Quoting Daniel Lezcano ([EMAIL PROTECTED]): * What are the problems that the linux community can solve with the checkpoint/restart ? Eric Biederman reminds at the previous OLS nobody complained about the checkpoint/restart Pavel Emylianov : The startup of Oracle takes some minutes, if we checkpoint just after the startup, Oracle can be restarted from this point later and provide fast startup Oren Laaden : Time travel, we can do monotonic snapshot and go back on one of this snaphost. Eric Biedreman : Priority running, checkpoint/kill an application and run another application with a bigger priority Denis Lunev : Task migration, move application on one host to another host Daniel Lezcano : SSI (task migration) * Preparing the kernel internals OL : Can we implement a kernel module and move CR functionality into the kernel itself later ? EB : Better to add a little CR functionnality into the kernel itself and add more after. DLu : Problem with kernel version OL : Compatibility with intermediate kernel version should be possible with userspace conversion tools DLu : Non sequential file for checkpoint statefile is a challenge OL : yes, but possible and useful for compression/encryption We showed that there are five steps to realize a checkpoint: 1 - Pre-dump I'd just add here that the pre-dump is where you might start writing memory to disk, trying to get disk and memory closer and closer to being the same until, at some point, you decide they are close enough that you can go on to step two, and attempt the freeze+dump+migrate/kill with minimal downtime. Coming into the discussion my primary concern had been that doing a sys_checkpoint() system call would be tough to augment to provide this kind of incremental checkpoint, but this breakdown is great for that. 2 - Freeze 3 - Dump 4 - Resume/kill 5 - Post-dump At this point we state we want create a proof of concept and checkpoint/restart the simplest application. By which we mean, start with a piece of step 3 (and maybe a bit of step 4). step 4 is also part of the freezer -- it's the unfreeze operation (or force a SIGKILL to all processes in the container). Step 2 was pretty widely accepted to be the freezer subsystem, but noone seemed to be sure quite what the status of that was. Matt, can you remind us how the freezer cgroup is doing? We will add iteratively more and more kernel resources. Process hierarchy created from kernel or userspace ? OL : Seems better to send a chunk of data to kernel and that restores the processes hierarchy PE : Agreed OL : We should be able to checkpoint from inside the container, keep that in mind for later. = we need a syscall or a ioctl The first items to address before implementing the Checkpoint are: 1 - Make a container object (the context) 2 - Freeze the container (extend cgroup freezer ?) 3 - syscall | ioctl First step: * simplest application : A single process, without any file, no checkpoint of text file (same file system for restart), no signals, no syscall in the application, no ipc/no msgq, no network Second step: * multiple processes + zombie state Third step: * files, pipe, signals, socketpair ? This proof of concept must came with a documentation describing what is supported, what is not supported and what we plan to do. And there was talk of making sure that if you attempt to checkpoint an app using unsupported resources, we return -EAGAIN. There had been murmurings about giving more meaningful feedback, but I have no idea what that would look like. yes. some of it is mentioned in the notes that I put in the wiki. -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Oren Laadan [EMAIL PROTECTED] wrote: | | | Serge E. Hallyn wrote: | Quoting Daniel Lezcano ([EMAIL PROTECTED]): |* What are the problems that the linux community can solve with the | checkpoint/restart ? | | Eric Biederman reminds at the previous OLS nobody complained about the | checkpoint/restart | | Pavel Emylianov : The startup of Oracle takes some minutes, if we | checkpoint just after the startup, Oracle can be restarted from this | point later and provide fast startup | | Oren Laaden : Time travel, we can do monotonic snapshot and go back on | one of this snaphost. | | Eric Biedreman : Priority running, checkpoint/kill an application and | run another application with a bigger priority | | Denis Lunev : Task migration, move application on one host to another host | | Daniel Lezcano : SSI (task migration) | |* Preparing the kernel internals | | OL : Can we implement a kernel module and move CR functionality into | the kernel itself later ? | | EB : Better to add a little CR functionnality into the kernel itself | and add more after. | | DLu : Problem with kernel version | | OL : Compatibility with intermediate kernel version should be possible | with userspace conversion tools | | DLu : Non sequential file for checkpoint statefile is a challenge | | OL : yes, but possible and useful for compression/encryption | | We showed that there are five steps to realize a checkpoint: | | 1 - Pre-dump | | I'd just add here that the pre-dump is where you might start writing | memory to disk, trying to get disk and memory closer and closer to | being the same until, at some point, you decide they are close enough | that you can go on to step two, and attempt the freeze+dump+migrate/kill | with minimal downtime. | | Coming into the discussion my primary concern had been that doing a | sys_checkpoint() system call would be tough to augment to provide this | kind of incremental checkpoint, but this breakdown is great for that. | | 2 - Freeze | 3 - Dump | 4 - Resume/kill | 5 - Post-dump | | At this point we state we want create a proof of concept and | checkpoint/restart the simplest application. | | By which we mean, start with a piece of step 3 (and maybe a bit of | step 4). | | step 4 is also part of the freezer -- it's the unfreeze operation | (or force a SIGKILL to all processes in the container). Are steps 1-5 considered part of the sys_checkpoint() system call and if successful sys_checkpoint() returns after step 5 ? If so, like Serge points out, it would be harder to optimize for incremental checkpoints (as each sys_checkpoint() would be independent) ? But may not be something to worry about for POC. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes
Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): Oren Laadan [EMAIL PROTECTED] wrote: | | | Serge E. Hallyn wrote: | Quoting Daniel Lezcano ([EMAIL PROTECTED]): |* What are the problems that the linux community can solve with the | checkpoint/restart ? | | Eric Biederman reminds at the previous OLS nobody complained about the | checkpoint/restart | | Pavel Emylianov : The startup of Oracle takes some minutes, if we | checkpoint just after the startup, Oracle can be restarted from this | point later and provide fast startup | | Oren Laaden : Time travel, we can do monotonic snapshot and go back on | one of this snaphost. | | Eric Biedreman : Priority running, checkpoint/kill an application and | run another application with a bigger priority | | Denis Lunev : Task migration, move application on one host to another host | | Daniel Lezcano : SSI (task migration) | |* Preparing the kernel internals | | OL : Can we implement a kernel module and move CR functionality into | the kernel itself later ? | | EB : Better to add a little CR functionnality into the kernel itself | and add more after. | | DLu : Problem with kernel version | | OL : Compatibility with intermediate kernel version should be possible | with userspace conversion tools | | DLu : Non sequential file for checkpoint statefile is a challenge | | OL : yes, but possible and useful for compression/encryption | | We showed that there are five steps to realize a checkpoint: | | 1 - Pre-dump | | I'd just add here that the pre-dump is where you might start writing | memory to disk, trying to get disk and memory closer and closer to | being the same until, at some point, you decide they are close enough | that you can go on to step two, and attempt the freeze+dump+migrate/kill | with minimal downtime. | | Coming into the discussion my primary concern had been that doing a | sys_checkpoint() system call would be tough to augment to provide this | kind of incremental checkpoint, but this breakdown is great for that. | | 2 - Freeze | 3 - Dump | 4 - Resume/kill | 5 - Post-dump | | At this point we state we want create a proof of concept and | checkpoint/restart the simplest application. | | By which we mean, start with a piece of step 3 (and maybe a bit of | step 4). | | step 4 is also part of the freezer -- it's the unfreeze operation | (or force a SIGKILL to all processes in the container). Are steps 1-5 considered part of the sys_checkpoint() system call and if successful sys_checkpoint() returns after step 5 ? If so, like Serge points out, it would be harder to optimize for incremental checkpoints (as each sys_checkpoint() would be independent) ? No no, the idea (IIUC) is that if you want to do a very short-downtime migrate, you stay in step 1 for a long time, writing the container memory to disk, checking how different the disk img is from the memory image, updating the version on disk, checking again, etc. Then when you decide that the disk and memory are very close together, you quickly do steps 2-4, where 4 in this case is kill. In the meantime you would have been loading the disk data into memory ahead of time at the new machine, so you can also quickly complete the restart. So 3, 'Dump', in this case really becomes dump the metadata and any more changes that have happened. Presumably, if when you get to 3, you find that there was suddenly a lot of activity and there is too much data to write quickly, you bail on the migrate and step 4 is a resume rather than kill. Then you start again at step 1. At least that was my understanding. But may not be something to worry about for POC. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel