[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-26 Thread Eric W. Biederman
Daniel Lezcano [EMAIL PROTECTED] writes:

 Eric will post a patch to automatically destroy the virtual devices when the
 netns is destroyed, so there is no way to know if a network  namespace is
 dead or not as the uevent socket will not deliver an event  outside of the
 container.

 My question remains: who cares?

 The container implementation in userspace. Let's imagine it sets some routes
 outside of the container to route the traffic to the container. It should 
 remove
 these routes when the container dies. And the container should be considered 
 as
 dead when the network has died and not when the last process of the container
 exits.

Namespaces can definitely live on long past the time when there are any tasks
that point to them from nsproxy, and knowing when that happens would be nice.
So settling on pids for names would be nice as that would allows us to 
restructure
/proc so that we could see those kinds of things.

That said I am less certain of the need to actually wait for a network namespace
to exit, once we start killing virtual network devices.

It was mentioned that ip over ip tunnels don't currently have a dellink method 
so we need
will still need a wait to handle that case.

Similarly in general we need to wait until the network namespace exits to ensure
we flush all of the outgoing packets at container shutdown.

So I propose we remove merge the code to wait on delete virtual devices and then
recheck to see what uses we actually have left.

Eric




___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem
 
 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.
 
 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.
 
 At a first glance waitid(P_NS, pid,) looks like a useful way to achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD either
 as a signal or with signalfd.
 
 So how shall we wait for a namespace to exit?  My brainstorm tonight suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);
 
 Eric

I'm sorry, I'm still not quite clear on...

Why?

You care about when the tasks exit, and you care about when network
devices, for instance, need to be deleted (which you can presumably
get uevents for, when they get moved back into init_net_ns).

Why do you care when the struct net actually gets deleted?

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Daniel Lezcano
Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric
 
 I'm sorry, I'm still not quite clear on...
 
 Why?
 
 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).
 
 Why do you care when the struct net actually gets deleted?

IMO, if we consider a container being an aggregation of different 
namespaces, we should consider the container dies when all the 
namespaces are dead.

One good example is an application ran inside a container and doing a 
bulk data writing over the network. When the application finish its last 
  call to send it will exits. At this point, there is no more 
processes running inside the container but we can not consider the 
container is dead because there are still some pending datas in the 
socket to be delivered to the peer.

Eric will post a patch to automatically destroy the virtual devices when 
the netns is destroyed, so there is no way to know if a network 
namespace is dead or not as the uevent socket will not deliver an event 
outside of the container.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Serge E. Hallyn
Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to 
 achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight 
 suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric

 I'm sorry, I'm still not quite clear on...

 Why?

 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).

 Why do you care when the struct net actually gets deleted?

 IMO, if we consider a container being an aggregation of different  
 namespaces, we should consider the container dies when all the  
 namespaces are dead.

 One good example is an application ran inside a container and doing a  
 bulk data writing over the network. When the application finish its last  
  call to send it will exits. At this point, there is no more processes 
 running inside the container but we can not consider the container is 
 dead because there are still some pending datas in the socket to be 
 delivered to the peer.

 Eric will post a patch to automatically destroy the virtual devices when  
 the netns is destroyed, so there is no way to know if a network  
 namespace is dead or not as the uevent socket will not deliver an event  
 outside of the container.

My question remains: who cares?

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Oren Laadan


Serge E. Hallyn wrote:
 Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to 
 achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight 
 suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric
 I'm sorry, I'm still not quite clear on...

 Why?

 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).

 Why do you care when the struct net actually gets deleted?
 IMO, if we consider a container being an aggregation of different  
 namespaces, we should consider the container dies when all the  
 namespaces are dead.

 One good example is an application ran inside a container and doing a  
 bulk data writing over the network. When the application finish its last  
  call to send it will exits. At this point, there is no more processes 
 running inside the container but we can not consider the container is 
 dead because there are still some pending datas in the socket to be 
 delivered to the peer.

 Eric will post a patch to automatically destroy the virtual devices when  
 the netns is destroyed, so there is no way to know if a network  
 namespace is dead or not as the uevent socket will not deliver an event  
 outside of the container.
 
 My question remains: who cares?
 

In the context of CR, you'd care if you migrate a container including its
network stack. In that case, you wanna make sure that:
(1) you save sockets that have data in their (send) queue but otherwise
not attached to any specific process, and
(2) you disable these sockets at the source machine as soon as you enable
the container on the target machine.

Rethinking this, Serge is probably right because one you migrate the network
to the target node, you disable the network (of that container) on the source
node, so you don't care about #2 there anymore...

Oren.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Daniel Lezcano
Serge E. Hallyn wrote:
 Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to 
 achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight 
 suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric
 I'm sorry, I'm still not quite clear on...

 Why?

 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).

 Why do you care when the struct net actually gets deleted?
 IMO, if we consider a container being an aggregation of different  
 namespaces, we should consider the container dies when all the  
 namespaces are dead.

 One good example is an application ran inside a container and doing a  
 bulk data writing over the network. When the application finish its last  
  call to send it will exits. At this point, there is no more processes 
 running inside the container but we can not consider the container is 
 dead because there are still some pending datas in the socket to be 
 delivered to the peer.

 Eric will post a patch to automatically destroy the virtual devices when  
 the netns is destroyed, so there is no way to know if a network  
 namespace is dead or not as the uevent socket will not deliver an event  
 outside of the container.
 
 My question remains: who cares?

The container implementation in userspace. Let's imagine it sets some 
routes outside of the container to route the traffic to the container. 
It should remove these routes when the container dies. And the container 
should be considered as dead when the network has died and not when the 
last process of the container exits.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-24 Thread Eric W. Biederman

Currently we have three possibilities on how to name pid namespaces.
- indirect via processes
- pids
- names in the filesystem

We discussed this a bit in the hallway track and pids are look like the way
to go. Pavel has a patch in progress to help sort this out.

The practical problem we have today is that we need a way to wait for the 
network
namespace in particular and namespaces in general to exit.

At a first glance waitid(P_NS, pid,) looks like a useful way to achieve
this.  After looking at wait a bit more it really is fundamentally just an exit
status reaper of zombies, that has the option of blocking when the zombies
do not yet exist.  In any kind of event loop you would wait for SIGCHLD either
as a signal or with signalfd.

So how shall we wait for a namespace to exit?  My brainstorm tonight suggests
inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-24 Thread Eric W. Biederman
Serge E. Hallyn [EMAIL PROTECTED] writes:

 No no, the idea (IIUC) is that if you want to do a very short-downtime
 migrate, you stay in step 1 for a long time, writing the container
 memory to disk, checking how different the disk img is from the memory
 image, updating the version on disk, checking again, etc.  Then when
 you decide that the disk and memory are very close together, you
 quickly do steps 2-4, where 4 in this case is kill.  In the meantime
 you would have been loading the disk data into memory ahead of time
 at the new machine, so you can also quickly complete the restart.

 So 3, 'Dump', in this case really becomes dump the metadata and any
 more changes that have happened.  Presumably, if when you get to 3,
 you find that there was suddenly a lot of activity and there is too
 much data to write quickly, you bail on the migrate and step 4 is
 a resume rather than kill.  Then you start again at step 1.

 At least that was my understanding.

Yes.  Too some extent you need those steps separate in the kernel so you can
coordinate with filesystem snapshots and the like.

Despite being in one large syscall we still have a few small other pieces
of userspace we need to coordinate with.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-24 Thread Oren Laadan

Let's have some more breakfast, tomorrow - Friday - morning.
Same place, same time. If it doesn't rain we'll go outside ;)

Oren.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-24 Thread Daniel Lezcano
Oren Laadan wrote:
 Let's have some more breakfast, tomorrow - Friday - morning.
 Same place, same time. If it doesn't rain we'll go outside ;)
 
Acked-by: Daniel Lezcano [EMAIL PROTECTED]
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-23 Thread Eric W. Biederman
Daniel Lezcano [EMAIL PROTECTED] writes:

   * What are the problems that the linux community can solve with the 
 checkpoint/restart ?

   Eric Biederman reminds at the previous OLS nobody complained about the 
 checkpoint/restart

Kernel summit.  Not OLS.   Which is a room packed full of maintainers.
It isn't an endorsement but it also such a scary idea that people immediately
rejected it either.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-23 Thread Serge E. Hallyn
Quoting Oren Laadan ([EMAIL PROTECTED]):
 Hi,
 
 I've placed a somewhat more detailed summary on the wiki:
 http://wiki.openvz.org/Containers/Mini-summit_2008_notes
 
 (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008)
 
 To further discuss technical details, let's schedule to meet while we are
 here for the OLS. I suggest the following for a start:
 
 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :)
 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center.
 
 Please confirm your participation.

Hey,

I've committed to dinner with another group.  I'm definately up for
breakfast.  So I'll be at 3d floor congress center at mall entrance at
8:30am.

thanks,
-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] Re: C/R minisummit notes

2008-07-23 Thread Denis V. Lunev
On Wed, 2008-07-23 at 14:55 -0400, Oren Laadan wrote:
 Hi,
 
 I've placed a somewhat more detailed summary on the wiki:
 http://wiki.openvz.org/Containers/Mini-summit_2008_notes
 
 (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008)
 
 To further discuss technical details, let's schedule to meet while we are
 here for the OLS. I suggest the following for a start:
 
 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :)
+1

I think that we could meet at the registration desc. Any objections?

 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center.
+1

Regards,
Den

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-23 Thread Daniel Lezcano
Oren Laadan wrote:
 Hi,
 
 I've placed a somewhat more detailed summary on the wiki:
 http://wiki.openvz.org/Containers/Mini-summit_2008_notes
 
 (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008)
 
 To further discuss technical details, let's schedule to meet while we are
 here for the OLS. I suggest the following for a start:
 
 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :)
 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center.
 
 Please confirm your participation.

Benjamin, Dave and I, we will come.
Pavel, Denis and Andrey will come too.

I am looking for the Kerrighed guys to come too.

We meet in the hall of the congress at 7:00pm




___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-23 Thread Serge E. Hallyn
Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 
   * What are the problems that the linux community can solve with the 
 checkpoint/restart ?
 
   Eric Biederman reminds at the previous OLS nobody complained about the 
 checkpoint/restart
 
   Pavel Emylianov : The startup of Oracle takes some minutes, if we 
 checkpoint just after the startup, Oracle can be restarted from this 
 point later and provide fast startup
 
   Oren Laaden : Time travel, we can do monotonic snapshot and go back on 
 one of this snaphost.
 
   Eric Biedreman : Priority running, checkpoint/kill an application and 
 run another application with a bigger priority
 
   Denis Lunev : Task migration, move application on one host to another 
 host
 
   Daniel Lezcano : SSI (task migration)
 
   * Preparing the kernel internals
 
   OL : Can we implement a kernel module and move CR functionality into 
 the kernel itself later ?
 
   EB : Better to add a little CR functionnality into the kernel itself 
 and add more after.
 
   DLu : Problem with kernel version
 
   OL : Compatibility with intermediate kernel version should be possible 
 with userspace conversion tools
 
   DLu : Non sequential file for checkpoint statefile is a challenge
 
   OL : yes, but possible and useful for compression/encryption
 
   We showed that there are five steps to realize a checkpoint:
 
   1 - Pre-dump

I'd just add here that the pre-dump is where you might start writing
memory to disk, trying to get disk and memory closer and closer to
being the same until, at some point, you decide they are close enough
that you can go on to step two, and attempt the freeze+dump+migrate/kill
with minimal downtime.

Coming into the discussion my primary concern had been that doing a
sys_checkpoint() system call would be tough to augment to provide this
kind of incremental checkpoint, but this breakdown is great for that.

   2 - Freeze
   3 - Dump
   4 - Resume/kill
   5 - Post-dump
 
   At this point we state we want create a proof of concept and 
 checkpoint/restart the simplest application.

By which we mean, start with a piece of step 3 (and maybe a bit of
step 4).

Step 2 was pretty widely accepted to be the freezer subsystem, but
noone seemed to be sure quite what the status of that was.

Matt, can you remind us how the freezer cgroup is doing?

   We will add iteratively more and more kernel resources.
 
   Process hierarchy created from kernel or userspace ?
 
   OL : Seems better to send a chunk of data to kernel and that restores 
 the processes hierarchy
   PE : Agreed
   OL : We should be able to checkpoint from inside the container, keep 
 that in mind for later.
   
   = we need a syscall or a ioctl
 
   The first items to address before implementing the Checkpoint are:
   1 - Make a container object (the context)
   2 - Freeze the container (extend cgroup freezer ?)
   3 - syscall | ioctl
 
   First step:
   * simplest application : A single process, without any file, no 
 checkpoint of text file (same file system for restart), no signals, no 
 syscall in the application, no ipc/no msgq, no network
 
   Second step:
   * multiple processes + zombie state
 
   Third step:
   * files, pipe, signals, socketpair ?
 
   This proof of concept must came with a documentation describing what is 
 supported, what is not supported and what we plan to do.

And there was talk of making sure that if you attempt to checkpoint an
app using unsupported resources, we return -EAGAIN.  There had been
murmurings about giving more meaningful feedback, but I have no idea
what that would look like.

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-23 Thread Oren Laadan


Serge E. Hallyn wrote:
 Quoting Daniel Lezcano ([EMAIL PROTECTED]):
   * What are the problems that the linux community can solve with the 
 checkpoint/restart ?

  Eric Biederman reminds at the previous OLS nobody complained about the 
 checkpoint/restart

  Pavel Emylianov : The startup of Oracle takes some minutes, if we 
 checkpoint just after the startup, Oracle can be restarted from this 
 point later and provide fast startup

  Oren Laaden : Time travel, we can do monotonic snapshot and go back on 
 one of this snaphost.

  Eric Biedreman : Priority running, checkpoint/kill an application and 
 run another application with a bigger priority

  Denis Lunev : Task migration, move application on one host to another 
 host

  Daniel Lezcano : SSI (task migration)

   * Preparing the kernel internals

  OL : Can we implement a kernel module and move CR functionality into 
 the kernel itself later ?

  EB : Better to add a little CR functionnality into the kernel itself 
 and add more after.

  DLu : Problem with kernel version

  OL : Compatibility with intermediate kernel version should be possible 
 with userspace conversion tools

  DLu : Non sequential file for checkpoint statefile is a challenge

  OL : yes, but possible and useful for compression/encryption

  We showed that there are five steps to realize a checkpoint:

  1 - Pre-dump
 
 I'd just add here that the pre-dump is where you might start writing
 memory to disk, trying to get disk and memory closer and closer to
 being the same until, at some point, you decide they are close enough
 that you can go on to step two, and attempt the freeze+dump+migrate/kill
 with minimal downtime.
 
 Coming into the discussion my primary concern had been that doing a
 sys_checkpoint() system call would be tough to augment to provide this
 kind of incremental checkpoint, but this breakdown is great for that.
 
  2 - Freeze
  3 - Dump
  4 - Resume/kill
  5 - Post-dump

  At this point we state we want create a proof of concept and 
 checkpoint/restart the simplest application.
 
 By which we mean, start with a piece of step 3 (and maybe a bit of
 step 4).

step 4 is also part of the freezer -- it's the unfreeze operation
(or force a SIGKILL to all processes in the container).

 
 Step 2 was pretty widely accepted to be the freezer subsystem, but
 noone seemed to be sure quite what the status of that was.
 
 Matt, can you remind us how the freezer cgroup is doing?
 
  We will add iteratively more and more kernel resources.

  Process hierarchy created from kernel or userspace ?

  OL : Seems better to send a chunk of data to kernel and that restores 
 the processes hierarchy
  PE : Agreed
  OL : We should be able to checkpoint from inside the container, keep 
 that in mind for later.
  
  = we need a syscall or a ioctl

  The first items to address before implementing the Checkpoint are:
  1 - Make a container object (the context)
  2 - Freeze the container (extend cgroup freezer ?)
  3 - syscall | ioctl

  First step:
  * simplest application : A single process, without any file, no 
 checkpoint of text file (same file system for restart), no signals, no 
 syscall in the application, no ipc/no msgq, no network

  Second step:
  * multiple processes + zombie state

  Third step:
  * files, pipe, signals, socketpair ?

  This proof of concept must came with a documentation describing what is 
 supported, what is not supported and what we plan to do.
 
 And there was talk of making sure that if you attempt to checkpoint an
 app using unsupported resources, we return -EAGAIN.  There had been
 murmurings about giving more meaningful feedback, but I have no idea
 what that would look like.

yes. some of it is mentioned in the notes that I put in the wiki.


 
 -serge
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-23 Thread sukadev
Oren Laadan [EMAIL PROTECTED] wrote:
| 
| 
| Serge E. Hallyn wrote:
|  Quoting Daniel Lezcano ([EMAIL PROTECTED]):
|* What are the problems that the linux community can solve with the 
|  checkpoint/restart ?
| 
| Eric Biederman reminds at the previous OLS nobody complained about the 
|  checkpoint/restart
| 
| Pavel Emylianov : The startup of Oracle takes some minutes, if we 
|  checkpoint just after the startup, Oracle can be restarted from this 
|  point later and provide fast startup
| 
| Oren Laaden : Time travel, we can do monotonic snapshot and go back on 
|  one of this snaphost.
| 
| Eric Biedreman : Priority running, checkpoint/kill an application and 
|  run another application with a bigger priority
| 
| Denis Lunev : Task migration, move application on one host to another 
host
| 
| Daniel Lezcano : SSI (task migration)
| 
|* Preparing the kernel internals
| 
| OL : Can we implement a kernel module and move CR functionality into 
|  the kernel itself later ?
| 
| EB : Better to add a little CR functionnality into the kernel itself 
|  and add more after.
| 
| DLu : Problem with kernel version
| 
| OL : Compatibility with intermediate kernel version should be possible 
|  with userspace conversion tools
| 
| DLu : Non sequential file for checkpoint statefile is a challenge
| 
| OL : yes, but possible and useful for compression/encryption
| 
| We showed that there are five steps to realize a checkpoint:
| 
| 1 - Pre-dump
|  
|  I'd just add here that the pre-dump is where you might start writing
|  memory to disk, trying to get disk and memory closer and closer to
|  being the same until, at some point, you decide they are close enough
|  that you can go on to step two, and attempt the freeze+dump+migrate/kill
|  with minimal downtime.
|  
|  Coming into the discussion my primary concern had been that doing a
|  sys_checkpoint() system call would be tough to augment to provide this
|  kind of incremental checkpoint, but this breakdown is great for that.
|  
| 2 - Freeze
| 3 - Dump
| 4 - Resume/kill
| 5 - Post-dump
| 
| At this point we state we want create a proof of concept and 
|  checkpoint/restart the simplest application.
|  
|  By which we mean, start with a piece of step 3 (and maybe a bit of
|  step 4).
| 
| step 4 is also part of the freezer -- it's the unfreeze operation
| (or force a SIGKILL to all processes in the container).

Are steps 1-5 considered part of the sys_checkpoint() system call and
if successful sys_checkpoint() returns after step 5 ?

If so, like Serge points out, it would be harder to optimize for
incremental checkpoints (as each sys_checkpoint() would be independent) ?

But may not be something to worry about for POC.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes

2008-07-23 Thread Serge E. Hallyn
Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
 Oren Laadan [EMAIL PROTECTED] wrote:
 | 
 | 
 | Serge E. Hallyn wrote:
 |  Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 |* What are the problems that the linux community can solve with the 
 |  checkpoint/restart ?
 | 
 |   Eric Biederman reminds at the previous OLS nobody complained about the 
 |  checkpoint/restart
 | 
 |   Pavel Emylianov : The startup of Oracle takes some minutes, if we 
 |  checkpoint just after the startup, Oracle can be restarted from this 
 |  point later and provide fast startup
 | 
 |   Oren Laaden : Time travel, we can do monotonic snapshot and go back on 
 |  one of this snaphost.
 | 
 |   Eric Biedreman : Priority running, checkpoint/kill an application and 
 |  run another application with a bigger priority
 | 
 |   Denis Lunev : Task migration, move application on one host to another 
 host
 | 
 |   Daniel Lezcano : SSI (task migration)
 | 
 |* Preparing the kernel internals
 | 
 |   OL : Can we implement a kernel module and move CR functionality into 
 |  the kernel itself later ?
 | 
 |   EB : Better to add a little CR functionnality into the kernel itself 
 |  and add more after.
 | 
 |   DLu : Problem with kernel version
 | 
 |   OL : Compatibility with intermediate kernel version should be possible 
 |  with userspace conversion tools
 | 
 |   DLu : Non sequential file for checkpoint statefile is a challenge
 | 
 |   OL : yes, but possible and useful for compression/encryption
 | 
 |   We showed that there are five steps to realize a checkpoint:
 | 
 |   1 - Pre-dump
 |  
 |  I'd just add here that the pre-dump is where you might start writing
 |  memory to disk, trying to get disk and memory closer and closer to
 |  being the same until, at some point, you decide they are close enough
 |  that you can go on to step two, and attempt the freeze+dump+migrate/kill
 |  with minimal downtime.
 |  
 |  Coming into the discussion my primary concern had been that doing a
 |  sys_checkpoint() system call would be tough to augment to provide this
 |  kind of incremental checkpoint, but this breakdown is great for that.
 |  
 |   2 - Freeze
 |   3 - Dump
 |   4 - Resume/kill
 |   5 - Post-dump
 | 
 |   At this point we state we want create a proof of concept and 
 |  checkpoint/restart the simplest application.
 |  
 |  By which we mean, start with a piece of step 3 (and maybe a bit of
 |  step 4).
 | 
 | step 4 is also part of the freezer -- it's the unfreeze operation
 | (or force a SIGKILL to all processes in the container).
 
 Are steps 1-5 considered part of the sys_checkpoint() system call and
 if successful sys_checkpoint() returns after step 5 ?
 
 If so, like Serge points out, it would be harder to optimize for
 incremental checkpoints (as each sys_checkpoint() would be independent) ?

No no, the idea (IIUC) is that if you want to do a very short-downtime
migrate, you stay in step 1 for a long time, writing the container
memory to disk, checking how different the disk img is from the memory
image, updating the version on disk, checking again, etc.  Then when
you decide that the disk and memory are very close together, you
quickly do steps 2-4, where 4 in this case is kill.  In the meantime
you would have been loading the disk data into memory ahead of time
at the new machine, so you can also quickly complete the restart.

So 3, 'Dump', in this case really becomes dump the metadata and any
more changes that have happened.  Presumably, if when you get to 3,
you find that there was suddenly a lot of activity and there is too
much data to write quickly, you bail on the migrate and step 4 is
a resume rather than kill.  Then you start again at step 1.

At least that was my understanding.

 But may not be something to worry about for POC.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel