[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-12 Thread Joseph Ruscio

On Aug 5, 2008, at 9:23 AM, Dave Hansen wrote:

 How would you propose making it modular?

Dave,

What about re-using the madvise() interface for this? Adding a flag  
along the lines of MADV_DONTCHECKPOINT? I could probably work up a  
patch to Oren's that removes these from the checkpointed ranges if  
people think that's feasible.

thanks,
Joe
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-12 Thread Joseph Ruscio

On Aug 5, 2008, at 9:20 AM, Oren Laadan wrote:



 Louis Rilling wrote:
 On Mon, Aug 04, 2008 at 08:51:37PM -0700, Joseph Ruscio wrote:
 As somewhat of a tangent to this discussion, I've been giving some
 thought to the general strategy we talked about during the summit.  
 The
 checkpointing solution we built at Evergrid sits completely in  
 userspace
 and is soley focused on checkpointing parallel codes (e.g. MPI).  
 That
 approach required us to virtualize a whole slew of resources (e.g.  
 PIDs)
 that will be far better supported in the kernel through this  
 effort. On
 the other hand, there isn't anything inherent to checkpointing the  
 memory
 in a process that requires it to be in a kernel. During a restart,  
 you
 can map and load the memory from the checkpoint file in userspace as
 easily as in the kernel. Since the cost of checkpointing HPC codes  
 is

 Hmm, for unusual mappings this may be not so easy to reproduce from
 userspace if binaries are statically linked. I agree that with
 dynamically linked applications, LD_PRELOAD allows one to record the
 actual memory mappings and restore them at restart.

 I second that: unusual mapping can be hard to reproduce.

 Besides, several important optimization are difficult to do in user- 
 space,
 if at all possible:

 * detecting sharing (unless the application itself gives the OS an  
 advice -
 more on this below); In the kernel, this is detected easily using  
 the inode
 that represents a shared memory region in SHMFS


 * detecting (and restoring) COW sharing: process A forks process B,  
 so at
 least initially the private memory of both is the same via COW; this  
 can be
 optimized to save the memory of only one instead of both, and  
 restore this
 COW relationship on restart.

Both of these are possible from userspace, but agreeably more  
complicated. Also agree that statically linked binaries are not really  
feasible in user-space.


 * reducing checkpoint downtime using the COW technique that I  
 described at
 the summit: when processes are frozen, mark all dirty pages COW and  
 keep a
 reference, and write-back the contents only after the container is  
 unfrozen.

Our user-space implementation already has a complete concurrent (i.e.  
COW) checkpointing implementation where the freeze period lasts only  
the length of time it takes to mprotect() the allocated memory  
regions. So I don't necessarily agree that these optimizations require  
kernel access.

 Eh... and, yes, live migration :)

  User-space live migration of a batch process e.g. one taking place  
in an MPI job is quite trivial. User-space live migration of something  
like a database is not that hard assuming you have a cooperative load  
balancer or proxy on the front end.

I'm not advocating for implementing this in user-space. I am in  
complete agreement that this effort should result in code that  
completely checkpoints a Container in the kernel. My question was  
whether there are situations where it would be advantageous for user- 
space to have the option of instructing/hinting the kernel to ignore  
certain resources that it would handle itself. Most of the use-cases  
I'm thinking of come from the different styles of implementations I've  
seen in the HPC space, where our implementation (and a lot of others)  
are focused.

MPI codes require coordination between all the different processes  
taking part to ensure that the checkpoints are globally consistent.  
MPI implementations that run on hardware such as Infiniband would most  
likely want the container checkpointing to ignore all of the pinned  
memory associated with the RDMA operations so that the coordination  
and recreation of MPI communicator state could be handled in user- 
space. When working with inflexible process checkpointers, MPI  
coordination routines often must completely teardown all communicator  
state prior to invoking the checkpoint, and then recreate all the  
communicators after the checkpoint. On very large scale jobs, this is  
expensive.

As another example HPC applications can create local scratch files of  
several GB in /tmp. It may not be necessary to migrate these files,  
but if user-space has no way to mark a particular file, local files,  
or files in general as being ignored, then we'll have to copy these  
during a migration or a checkpoint.

I don't suppose anyone is attending Linuxworld in San Francisco this  
week? I'd be more then happy to grab a coffee and talk about some of  
this. I stopped by the OpenVZ booth but none of the devs are around.

thanks,
Joe
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-08 Thread Dave Hansen
On Fri, 2008-08-08 at 10:20 -0700, Joseph Ruscio wrote:
 On Aug 5, 2008, at 9:23 AM, Dave Hansen wrote:
  How would you propose making it modular?

 What about re-using the madvise() interface for this? Adding a flag  
 along the lines of MADV_DONTCHECKPOINT? I could probably work up a  
 patch to Oren's that removes these from the checkpointed ranges if  
 people think that's feasible.

Seems reasonable, but I think it is jumping the gun a little bit.  There
are plenty of features that will get us quicker, more efficient
checkpoints, but let's get *some* checkpointing in the kernel, first. :)

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-07 Thread Louis Rilling
On Wed, Aug 06, 2008 at 08:41:10AM -0700, Joseph Ruscio wrote:

 On Aug 5, 2008, at 9:20 AM, Oren Laadan wrote:
 Eh... and, yes, live migration :)

  User-space live migration of a batch process e.g. one taking place in 
 an MPI job is quite trivial. User-space live migration of something like 
 a database is not that hard assuming you have a cooperative load  
 balancer or proxy on the front end.

Hm, this means modifying the MPI run-time, right? Especially the ones relying on
daemons on each node (like LAM implementation, and MPI2 specification IIRC).
Anyway, this is probably not an issue, since most high-end HPC systems come with
their own customized MPI implementation.


 I'm not advocating for implementing this in user-space. I am in complete 
 agreement that this effort should result in code that completely 
 checkpoints a Container in the kernel. My question was whether there are 
 situations where it would be advantageous for user-space to have the 
 option of instructing/hinting the kernel to ignore certain resources that 
 it would handle itself. Most of the use-cases I'm thinking of come from 
 the different styles of implementations I've seen in the HPC space, where 
 our implementation (and a lot of others) are focused.

 MPI codes require coordination between all the different processes  
 taking part to ensure that the checkpoints are globally consistent. MPI 
 implementations that run on hardware such as Infiniband would most  
 likely want the container checkpointing to ignore all of the pinned  
 memory associated with the RDMA operations so that the coordination and 
 recreation of MPI communicator state could be handled in user-space. When 
 working with inflexible process checkpointers, MPI coordination routines 
 often must completely teardown all communicator state prior to invoking 
 the checkpoint, and then recreate all the communicators after the 
 checkpoint. On very large scale jobs, this is expensive.

 As another example HPC applications can create local scratch files of  
 several GB in /tmp. It may not be necessary to migrate these files, but 
 if user-space has no way to mark a particular file, local files, or 
 files in general as being ignored, then we'll have to copy these during a 
 migration or a checkpoint.

Definitely agree with you here. This is the kind of use-case we will study in
Kerrighed. (Actually the project is centered on supporting a petaflopic
application, with help from Kerrighed to tolerate failures).


 I don't suppose anyone is attending Linuxworld in San Francisco this  
 week? I'd be more then happy to grab a coffee and talk about some of  
 this. I stopped by the OpenVZ booth but none of the devs are around.

Not me, sorry :) However, whichever requirement you can describe is interesting
for us. They can surely help designing a most useful checkpoint/restart
mechanism.

Thanks,

Louis

-- 
Dr Louis RillingKerlabs
Skype: louis.rillingBatiment Germanium
Phone: (+33|0) 6 80 89 08 2380 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


signature.asc
Description: Digital signature
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-07 Thread Louis Rilling
On Wed, Aug 06, 2008 at 09:15:46AM -0700, Joseph Ruscio wrote:

 On Aug 5, 2008, at 9:23 AM, Dave Hansen wrote:

 On Mon, 2008-08-04 at 20:51 -0700, Joseph Ruscio wrote:
 It might be desirable for the checkpointing implementation to be
 modular enough that a userspace application or library could select  
 to
 handle certain resources on their own. Memory is the primary one that
 comes to mind.

 How would you propose making it modular?

 -- Dave



 Well it seems to me that the initial focus here is in live migration of 
 traditional enterprise applications, e.g. databases, app-servers, etc. I 
 think this is the right focus given how much utility the general 
 enterprise is finding in capabilities like VMotion. Providing this 
 mobility to applications without the overhead of traditional VM's would 
 be very valuable.

 On the other hand I've been primarily focused in checkpointing large- 
 scale MPI jobs to provide fault tolerance, and that use-case is somewhat 
 different then the live-migration one. These checkpoints have huge RAM 
 footprints (in-core checkpointing is not an option), require  
 coordination across large numbers of servers, some number of open files  
 on an enormous parallel filesystem, and some scratch files open on the 
 local disk/ramdisk. They generally have very simple process trees with 
 one process per core, or one process with a thread for each core.

 To support these kinds of jobs, one would ideally instruct the Container 
 checkpointer to ignore network resources, dynamically allocated private 
 memory, and the contents of open files. You'd be relying on the Container 
 checkpointer to recreate processes, open file descriptors, threads, 
 thread synchronization primitives, IPC mechanisms (including shm).

 As far as the mechanism is concerned, I'd defer to the more experienced 
 kernel developers here. I assume that passing a bitmask of flags as an 
 argument into the checkpoint syscall would be frowned upon, and anyways 
 redundant, as its unlikely that the mask would change within a container 
 from checkpoint to checkpoint. If each container is going to have a 
 CGroup filesystem directory, then we could have a file(s) along the lines 
 of /proc/sys/kernel/randomize_va_space that turn features off for that 
 Container. The default settings after Container creation would be a 
 complete in-kernel checkpoint/migration.

Did you think about mechanisms/interfaces making the kernel's checkpointing
sub-system and the application/run-time interact to efficiently build the
checkpoint image and restart from it?

Louis

-- 
Dr Louis RillingKerlabs
Skype: louis.rillingBatiment Germanium
Phone: (+33|0) 6 80 89 08 2380 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


signature.asc
Description: Digital signature
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-05 Thread Louis Rilling
On Mon, Aug 04, 2008 at 08:51:37PM -0700, Joseph Ruscio wrote:
 As somewhat of a tangent to this discussion, I've been giving some  
 thought to the general strategy we talked about during the summit. The  
 checkpointing solution we built at Evergrid sits completely in userspace 
 and is soley focused on checkpointing parallel codes (e.g. MPI). That 
 approach required us to virtualize a whole slew of resources (e.g. PIDs) 
 that will be far better supported in the kernel through this effort. On 
 the other hand, there isn't anything inherent to checkpointing the memory 
 in a process that requires it to be in a kernel. During a restart, you 
 can map and load the memory from the checkpoint file in userspace as 
 easily as in the kernel. Since the cost of checkpointing HPC codes is 

Hmm, for unusual mappings this may be not so easy to reproduce from
userspace if binaries are statically linked. I agree that with
dynamically linked applications, LD_PRELOAD allows one to record the
actual memory mappings and restore them at restart.

 fairly dominated by checkpointing their large memory footprints, memory 
 checkpointing is an area of ongoing research with many different 
 solutions.

 It might be desirable for the checkpointing implementation to be modular 
 enough that a userspace application or library could select to handle 
 certain resources on their own. Memory is the primary one that comes to 
 mind.

I definitely agree with you about this flexibility. Actually in
Kerrighed, during the next 3 years, we are going to study an API for
collaborative checkpoint/restart between kernel and userspace, in order to
allow such HPC apps to checkpoint huge memory efficiently (eg. when reaching
states where saving small parts is enough), or to rebuild their data from
partial/older states.
I hope that this study will bring useful ideas that could be applied to
containers as well.

Thanks,

Louis

-- 
Dr Louis RillingKerlabs - IRISA
Skype: louis.rillingCampus Universitaire de Beaulieu
Phone: (+33|0) 2 99 84 71 52Avenue du General Leclerc
Fax: (+33|0) 2 99 84 71 71  35042 Rennes CEDEX - France
http://www.kerlabs.com/
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-05 Thread Louis Rilling
On Mon, Aug 04, 2008 at 10:37:20PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 02:51:57PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote:
 I actually wasn't thinking of streaming a series of incremental checkpoints
 (from base and on) to implement migration... I simply didn't have a use-case
 for that :)

 This could be useful however. Since incremental checkpoint is faster
 this could reduce down-time.

 Naturally incremental checkpoint reduces downtime; however since each 
 checkpoint
 is taken at a different time, they can be streamed -- transferred over the
 network -- as they are taken. This gives more flexibility and can still, if
 you wish, can easily be transformed to a single long stream.

 Actually, this is a good argument in favor of using multiple files: they are a
 more flexible approach and can always be easily transformed to a single long
 stream, while the reverse isn't so.

Yes the reverse is as easy: rebuilding a full checkpoint of a given id
#id consists simply in removing the records that are tagged as invalid as
from checkpoints having ids = #id. This is actually what restart should
do :)

 The point is that you need previous data when building an incremental
 checkpoint, so you will read it at least. And since it was previously 
 stored (in
 The scheme that I described above and is implemented in Zap does not require
 access to previous checkpoints when building a new incremental checkpoint.
 Instead, you keep some data structure in the kernel that describes the 
 pieces
 that you need to carry with you (what pages were saved, and where; when a 
 task
 exits, the data describing its mm will be discarded, of course, and so on).

 This is because you probably decided that a mechanism in the kernel that 
 saves
 storage space was not interesting if it does not improve speed. As a
 consequence you need to keep metadata in kernel memory in order to do
 incremental checkpoint. Maybe saving storage space without considering
 speed could equally be done from userspace with sort of checkpoint diff
 tools that would create an incremental checkpoint 2' from two full
 checkpoints 1 and 2.

 Good point. In fact, the meta data is not only kept in memory, but also saved
 with each incremental checkpoint (well, its version at checkpoint time), so
 that restart would know where to find older data. So it is already transfered
 to user space; we may as well provide the option to keep it only in user 
 space.

That is userspace should give it back to the kernel before doing the
next incremental checkpoint?

Louis

-- 
Dr Louis RillingKerlabs - IRISA
Skype: louis.rillingCampus Universitaire de Beaulieu
Phone: (+33|0) 2 99 84 71 52Avenue du General Leclerc
Fax: (+33|0) 2 99 84 71 71  35042 Rennes CEDEX - France
http://www.kerlabs.com/
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-05 Thread Oren Laadan


Louis Rilling wrote:
 On Mon, Aug 04, 2008 at 08:51:37PM -0700, Joseph Ruscio wrote:
 As somewhat of a tangent to this discussion, I've been giving some  
 thought to the general strategy we talked about during the summit. The  
 checkpointing solution we built at Evergrid sits completely in userspace 
 and is soley focused on checkpointing parallel codes (e.g. MPI). That 
 approach required us to virtualize a whole slew of resources (e.g. PIDs) 
 that will be far better supported in the kernel through this effort. On 
 the other hand, there isn't anything inherent to checkpointing the memory 
 in a process that requires it to be in a kernel. During a restart, you 
 can map and load the memory from the checkpoint file in userspace as 
 easily as in the kernel. Since the cost of checkpointing HPC codes is 
 
 Hmm, for unusual mappings this may be not so easy to reproduce from
 userspace if binaries are statically linked. I agree that with
 dynamically linked applications, LD_PRELOAD allows one to record the
 actual memory mappings and restore them at restart.

I second that: unusual mapping can be hard to reproduce.

Besides, several important optimization are difficult to do in user-space,
if at all possible:

* detecting sharing (unless the application itself gives the OS an advice -
more on this below); In the kernel, this is detected easily using the inode
that represents a shared memory region in SHMFS

* detecting (and restoring) COW sharing: process A forks process B, so at
least initially the private memory of both is the same via COW; this can be
optimized to save the memory of only one instead of both, and restore this
COW relationship on restart.

* reducing checkpoint downtime using the COW technique that I described at
the summit: when processes are frozen, mark all dirty pages COW and keep a
reference, and write-back the contents only after the container is unfrozen.

Eh... and, yes, live migration :)

 
 fairly dominated by checkpointing their large memory footprints, memory 
 checkpointing is an area of ongoing research with many different 
 solutions.

 It might be desirable for the checkpointing implementation to be modular 
 enough that a userspace application or library could select to handle 
 certain resources on their own. Memory is the primary one that comes to 
 mind.
 
 I definitely agree with you about this flexibility. Actually in
 Kerrighed, during the next 3 years, we are going to study an API for
 collaborative checkpoint/restart between kernel and userspace, in order to
 allow such HPC apps to checkpoint huge memory efficiently (eg. when reaching
 states where saving small parts is enough), or to rebuild their data from
 partial/older states.
 I hope that this study will bring useful ideas that could be applied to
 containers as well.

Indeed it would add flexibility if an interface exists. One example is for
network connections in the case of a distributed MPI application, or if a
specific (otherwise unsupported for CR) device is involved.

As for memory, a clever way to hint the system about what parts of memory
are important, is to use something like an madvice() with a new flag, to
mark areas of interest/dis-interest.  Throw in a mechanism to notify tasks
(who request to be notified) of an upcoming checkpoint, end of successful
checkpoint, and completion of a successful restart - and you've got it all.

Oren.

 
 Thanks,
 
 Louis
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-05 Thread Dave Hansen
On Mon, 2008-08-04 at 20:51 -0700, Joseph Ruscio wrote:
 It might be desirable for the checkpointing implementation to be  
 modular enough that a userspace application or library could select to  
 handle certain resources on their own. Memory is the primary one that  
 comes to mind.

How would you propose making it modular?

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-04 Thread Louis Rilling
On Fri, Aug 01, 2008 at 02:51:57PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote:

Cut the less interesting (IMHO at least) history to make Dave happier ;)


 Returning 0 in case of a restart is what I called a special handling. You 
 won't
 do this for the other tasks, so this is special. Since userspace must cope 
 with
 it anyway, userspace can be clever enough to avoid using the fd on restart, 
 or
 stupid enough to destroy its checkpoint after restart.

 It's a different special hanlding :)   In the case of a single task that 
 wants
 to checkpoint itself - there are no other tasks.  In the case of a container -
 there will be only a single task that calls sys_checkpoint(), so only that 
 task
 will either get the CRID or the 0 (or an error). The other tasks will resume
 whatever it was that they were doing (lol, assuming of course restart works).

 So this special handling ends up being a two-liner: setting the return
 value of the syscall for the task that called sys_checkpoint() (well, actually
 it will call sys_restart() to restart, and return from sys_checkpoint() with
 a value of 0 ...).

I knew it, since I actually saw it in the patches you sent last week.


 If you use an FD, you will have to checkpoint that resource as part of the
 checkpoint, and restore it as part of the restart. In doing so you'll need
 to specially handle it, because it has a special meaning. I agree, of course,
 that it is feasible.


 - Userspace makes less errors when managing incremental checkpoints.
 have you implemented this ?  did you experience issues in real life ?  user
 space will need a way to manage all of it anyway in many aspects. This will
 be the last/least of the issues ...

 No it was not implemented, and I'm not going to enter a discussion about the
 weight of arguments whether they are backed by implementations or not. It 
 just
 becomes easier to create a mess with things depending on each other created 
 as
 separate, freely (userspace-decided)-named objects.

 If I were to write a user-space tool to handle this, I would keep each chain
 of checkpoints (from base and on) in a separate subdir, for example. In 
 fact,
 that's how I did it :)

This is intuitive indeed. Checkpoints are already organized in a similar way in
Kerrighed, except that a notion of application (transparent to applications)
replaces the notion of container, and the kernel decides where to put the
checkpoints and how they are named (I'm not saying that this is the best
way though).

 Besides, this scheme begins to sound much more complex than a single file.
 Do you really gain so much from not having multiple files, one per 
 checkpoint ?

 Well, at least you are not limited by the number of open file descriptors
 (assuming that, as you mentioned earlier, you pass an array of previous 
 images
 to compute the next incremental checkpoint).

 You aren't limited by the number of open file. User space could provide an 
 array
 of CRID, pathname (or serial#, pathname) to the kernel, the kernel will
 access the files as necessary.

But the kernel itself would have to cope with this limit (even if it is
not enforced, just to avoid consuming too much resources), or close and
reopen files when needed...


 Uhh .. hold on:  you need the array of previous checkpoint to _restart_ from
 an incremental checkpoint. You don't care about it when you checkpoint: 
 instead,
 you keep track in memory of (1) what changed (e.g. which pages where touched),
 and (2) where to find unmodified pages in previous checkpoints. You save this
 information with each new checkpoint.  The data structure to describe #2 is
 dynamic and changes with the execution, and easily keeps track of when older
 checkpoint images become irrelevant (because all the pages they hold have been
 overwritten already).

I see. I thought that you also intended to build incremental checkpoints
from previous checkpoints only, because even if this is not fast, this
saves storage space. I agree that if you always keep necessary metadata
in kernel memory, you don't need the previous images. Actually I don't
know any incremental checkpoint scheme not using such in-memory metadata
scheme. Which does not imply that other schemes are not relevant
though...



 where:
 - base_fd is a regular file containing the base checkpoint, or -1 if a full
   checkpoint should be done. The checkpoint could actually also live in 
 memory,
   and the kernel should check that it matches the image pointed to by 
 base_fd.
 - out_fd is whatever file/socket/etc. on which we should dump the 
 checkpoint. In
   particular, out_fd can equal base_fd and should point to the beginning 
 of the
   file if it's a regular file.
 Excellent example. What if the checkpoint data is streamed over the network;
 so you cannot rewrite the file after it has been streamed...  Or you will 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-04 Thread Oren Laadan


Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 02:51:57PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote:
 
 Cut the less interesting (IMHO at least) history to make Dave happier ;)
 
 Returning 0 in case of a restart is what I called a special handling. You 
 won't
 do this for the other tasks, so this is special. Since userspace must cope 
 with
 it anyway, userspace can be clever enough to avoid using the fd on restart, 
 or
 stupid enough to destroy its checkpoint after restart.
 It's a different special hanlding :)   In the case of a single task that 
 wants
 to checkpoint itself - there are no other tasks.  In the case of a container 
 -
 there will be only a single task that calls sys_checkpoint(), so only that 
 task
 will either get the CRID or the 0 (or an error). The other tasks will resume
 whatever it was that they were doing (lol, assuming of course restart works).

 So this special handling ends up being a two-liner: setting the return
 value of the syscall for the task that called sys_checkpoint() (well, 
 actually
 it will call sys_restart() to restart, and return from sys_checkpoint() with
 a value of 0 ...).
 
 I knew it, since I actually saw it in the patches you sent last week.
 
 If you use an FD, you will have to checkpoint that resource as part of the
 checkpoint, and restore it as part of the restart. In doing so you'll need
 to specially handle it, because it has a special meaning. I agree, of course,
 that it is feasible.

 
 - Userspace makes less errors when managing incremental checkpoints.
 have you implemented this ?  did you experience issues in real life ?  user
 space will need a way to manage all of it anyway in many aspects. This will
 be the last/least of the issues ...
 No it was not implemented, and I'm not going to enter a discussion about the
 weight of arguments whether they are backed by implementations or not. It 
 just
 becomes easier to create a mess with things depending on each other created 
 as
 separate, freely (userspace-decided)-named objects.
 If I were to write a user-space tool to handle this, I would keep each chain
 of checkpoints (from base and on) in a separate subdir, for example. In 
 fact,
 that's how I did it :)
 
 This is intuitive indeed. Checkpoints are already organized in a similar way 
 in
 Kerrighed, except that a notion of application (transparent to applications)
 replaces the notion of container, and the kernel decides where to put the
 checkpoints and how they are named (I'm not saying that this is the best
 way though).
 
 Besides, this scheme begins to sound much more complex than a single file.
 Do you really gain so much from not having multiple files, one per 
 checkpoint ?
 Well, at least you are not limited by the number of open file descriptors
 (assuming that, as you mentioned earlier, you pass an array of previous 
 images
 to compute the next incremental checkpoint).
 You aren't limited by the number of open file. User space could provide an 
 array
 of CRID, pathname (or serial#, pathname) to the kernel, the kernel will
 access the files as necessary.
 
 But the kernel itself would have to cope with this limit (even if it is
 not enforced, just to avoid consuming too much resources), or close and
 reopen files when needed...

You got - close and reopen as needed with LRU policy to decide which open file
to close. My experience so far is that you rarely need more than 100 open files.

 
 Uhh .. hold on:  you need the array of previous checkpoint to _restart_ from
 an incremental checkpoint. You don't care about it when you checkpoint: 
 instead,
 you keep track in memory of (1) what changed (e.g. which pages where 
 touched),
 and (2) where to find unmodified pages in previous checkpoints. You save this
 information with each new checkpoint.  The data structure to describe #2 is
 dynamic and changes with the execution, and easily keeps track of when older
 checkpoint images become irrelevant (because all the pages they hold have 
 been
 overwritten already).
 
 I see. I thought that you also intended to build incremental checkpoints
 from previous checkpoints only, because even if this is not fast, this
 saves storage space. I agree that if you always keep necessary metadata
 in kernel memory, you don't need the previous images. Actually I don't
 know any incremental checkpoint scheme not using such in-memory metadata
 scheme. Which does not imply that other schemes are not relevant
 though...
 

 where:
 - base_fd is a regular file containing the base checkpoint, or -1 if a 
 full
   checkpoint should be done. The checkpoint could actually also live in 
 memory,
   and the kernel should check that it matches the image pointed to by 
 base_fd.
 - out_fd is whatever file/socket/etc. on which we should dump the 
 checkpoint. In
   particular, out_fd can equal base_fd and should 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-01 Thread Oren Laadan


Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 12:28:57PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +int ret;
 +
 +/* FIX: need to test whether container is checkpointable */
 +
 +ret = cr_write_hdr(ctx);
 +if (!ret)
 +ret = cr_write_task(ctx, current);
 +if (!ret)
 +ret = cr_write_tail(ctx);
 +
 +/* on success, return (unique) checkpoint identifier */
 +if (!ret)
 +ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) 
 then
 this will be the identifier with which the restart (or cleanup) would 
 refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data 
 can
 persist between calls to sys_checkpoint(), and the 'crid', again, will 
 be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a 
 need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

...
crid = checkpoint(...);
switch (crid) {
case -1:
perror(checkpoint failed);
break;
default:
fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
/* proceed with execution after checkpoint */
...
break;
case 0:
fprintf(stderr, returned after restart\n);
/* proceed with action required following a restart */
...
break;
}
...
 If I understand correctly, this crid can live for quite a long time. So 
 many of
 them could be generated while some container would accumulate 
 incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for 
 another
 unrelated checkpoint during that time. This brings the issue of 
 allocating crids
 reliably (using something like a pidmap for instance). Moreover, if 
 such ids are
 exposed to userspace, we need to remember which ones are allocated 
 accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this case - the CRID are guaranteed to be 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-01 Thread Louis Rilling
On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote:


 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 12:28:57PM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 + int ret;
 +
 + /* FIX: need to test whether container is checkpointable */
 +
 + ret = cr_write_hdr(ctx);
 + if (!ret)
 + ret = cr_write_task(ctx, current);
 + if (!ret)
 + ret = cr_write_tail(ctx);
 +
 + /* on success, return (unique) checkpoint identifier */
 + if (!ret)
 + ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) 
 then
 this will be the identifier with which the restart (or cleanup) would 
 refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data 
 can
 persist between calls to sys_checkpoint(), and the 'crid', again, will 
 be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a 
 need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

 ...
 crid = checkpoint(...);
 switch (crid) {
 case -1:
 perror(checkpoint failed);
 break;
 default:
 fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
 /* proceed with execution after checkpoint */
 ...
 break;
 case 0:
 fprintf(stderr, returned after restart\n);
 /* proceed with action required following a restart */
 ...
 break;
 }
 ...
 If I understand correctly, this crid can live for quite a long time. So 
 many of
 them could be generated while some container would accumulate incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for 
 another
 unrelated checkpoint during that time. This brings the issue of 
 allocating crids
 reliably (using something like a pidmap for instance). Moreover, if such 
 ids are
 exposed to userspace, we need to remember which ones are allocated 
 accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this case - the CRID are guaranteed to 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-01 Thread Louis Rilling
On Wed, Jul 30, 2008 at 02:27:52PM -0400, Oren Laadan wrote:
 
 
 Louis Rilling wrote:
  +/**
  + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
  + * @ctx - checkpoint context
  + * @pgarr - page-array to fill
  + * @vma - vma to scan
  + * @start - start address (updated)
  + */
  +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
  +   struct vm_area_struct *vma, unsigned long *start)
  +{
  +  unsigned long end = vma-vm_end;
  +  unsigned long addr = *start;
  +  struct page **pagep;
  +  unsigned long *addrp;
  +  int cow, nr, ret = 0;
  +
  +  nr = pgarr-nleft;
  +  pagep = pgarr-pages[pgarr-nused];
  +  addrp = pgarr-addrs[pgarr-nused];
  +  cow = !!vma-vm_file;
  +
  +  while (addr  end) {
  +  struct page *page;
  +
  +  /* simplified version of get_user_pages(): already have vma,
  +  * only need FOLL_TOUCH, and (for now) ignore fault stats */
  +
  +  cond_resched();
  +  while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
  +  ret = handle_mm_fault(vma-vm_mm, vma, addr, 0);
  +  if (ret  VM_FAULT_ERROR) {
  +  if (ret  VM_FAULT_OOM)
  +  ret = -ENOMEM;
  +  else if (ret  VM_FAULT_SIGBUS)
  +  ret = -EFAULT;
  +  else
  +  BUG();
  +  break;
  +  }
  +  cond_resched();
  +  }
  
  I guess that 'ret' should be checked somewhere after this loop.
 
 yes; this is where a break(2) construct in C would come handy :)

Alternatively, putting the inner loop in a separate function often helps to
handle errors in a cleaner way.

 
  
  +
  +  if (IS_ERR(page)) {
  +  ret = PTR_ERR(page);
  +  break;
  +  }
  +
  +  if (page == ZERO_PAGE(0))
  +  page = NULL;/* zero page: ignore */
  +  else if (cow  page_mapping(page) != NULL)
  +  page = NULL;/* clean cow: ignore */
  +  else {
  +  get_page(page);
  +  *(addrp++) = addr;
  +  *(pagep++) = page;
  +  if (--nr == 0) {
  +  addr += PAGE_SIZE;
  +  break;
  +  }
  +  }
  +
  +  addr += PAGE_SIZE;
  +  }
  +
  +  if (unlikely(ret  0)) {
  +  nr = pgarr-nleft - nr;
  +  while (nr--)
  +  page_cache_release(*(--pagep));
  +  return ret;
  +  }
  +
  +  *start = addr;
  +  return (pgarr-nleft - nr);
  +}
  +


  +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
  +{
  +  struct cr_hdr h;
  +  struct cr_hdr_mm *hh = ctx-tbuf;
  +  struct mm_struct *mm;
  +  struct vm_area_struct *vma;
  +  int ret;
  +
  +  h.type = CR_HDR_MM;
  +  h.len = sizeof(*hh);
  +  h.id = ctx-pid;
  +
  +  mm = get_task_mm(t);
  +
  +  hh-tag = 1;/* non-zero will mean first time encounter */
  +
  +  hh-start_code = mm-start_code;
  +  hh-end_code = mm-end_code;
  +  hh-start_data = mm-start_data;
  +  hh-end_data = mm-end_data;
  +  hh-start_brk = mm-start_brk;
  +  hh-brk = mm-brk;
  +  hh-start_stack = mm-start_stack;
  +  hh-arg_start = mm-arg_start;
  +  hh-arg_end = mm-arg_end;
  +  hh-env_start = mm-env_start;
  +  hh-env_end = mm-env_end;
  +
  +  hh-map_count = mm-map_count;
  
  Some fields above should also be protected with mmap_sem, like -brk,
  -map_count, and possibly others (I'm not a memory expert though).
 
 true; keep in mind, though, that the container will be frozen during
 this time, so nothing should change at all. The only exception would
 be if, for instance, someone is killing the container while we save
 its state.

Sure. So you think that taking mm-mmap_sem below is useless? I tend to believe
so, since no other task should share this mm_struct at this time, and we could
state that ptrace should not interfere during restart. However, I'm never
confident when ptrace considerations come in...

 
  
  +
  +  /* FIX: need also mm-flags */
  +
  +  ret = cr_write_obj(ctx, h, hh);
  +  if (ret  0)
  +  goto out;
  +
  +  /* write the vma's */
  +  down_read(mm-mmap_sem);
  +  for (vma = mm-mmap; vma; vma = vma-vm_next) {
  +  if ((ret = cr_write_vma(ctx, vma))  0)
  +  break;
  +  }
  +  up_read(mm-mmap_sem);
  +
  +  if (ret  0)
  +  goto out;
  +
  +  ret = cr_write_mm_context(ctx, mm);
  +
  + out:
  +  mmput(mm);
  +  return ret;
  +}
  

Thanks,

Louis

-- 
Dr Louis RillingKerlabs
Skype: louis.rillingBatiment Germanium
Phone: (+33|0) 6 80 89 08 2380 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
___
Containers mailing 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-01 Thread Louis Rilling
On Thu, Jul 31, 2008 at 12:28:57PM -0400, Oren Laadan wrote:


 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +   int ret;
 +
 +   /* FIX: need to test whether container is checkpointable */
 +
 +   ret = cr_write_hdr(ctx);
 +   if (!ret)
 +   ret = cr_write_task(ctx, current);
 +   if (!ret)
 +   ret = cr_write_tail(ctx);
 +
 +   /* on success, return (unique) checkpoint identifier */
 +   if (!ret)
 +   ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) then
 this will be the identifier with which the restart (or cleanup) would 
 refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data can
 persist between calls to sys_checkpoint(), and the 'crid', again, will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

   ...
   crid = checkpoint(...);
   switch (crid) {
   case -1:
   perror(checkpoint failed);
   break;
   default:
   fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
   /* proceed with execution after checkpoint */
   ...
   break;
   case 0:
   fprintf(stderr, returned after restart\n);
   /* proceed with action required following a restart */
   ...
   break;
   }
   ...
 If I understand correctly, this crid can live for quite a long time. So 
 many of
 them could be generated while some container would accumulate incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for another
 unrelated checkpoint during that time. This brings the issue of allocating 
 crids
 reliably (using something like a pidmap for instance). Moreover, if such 
 ids are
 exposed to userspace, we need to remember which ones are allocated accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this case - the CRID are guaranteed to be unique per series of
 incremental checkpoints, and incremental chekcpoint is meaningless
 across reboots (and we can require that across migration too).

 Letting the kernel guess 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-01 Thread Oren Laadan


Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 12:28:57PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +  int ret;
 +
 +  /* FIX: need to test whether container is checkpointable */
 +
 +  ret = cr_write_hdr(ctx);
 +  if (!ret)
 +  ret = cr_write_task(ctx, current);
 +  if (!ret)
 +  ret = cr_write_tail(ctx);
 +
 +  /* on success, return (unique) checkpoint identifier */
 +  if (!ret)
 +  ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set 
 the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a 
 file) then
 this will be the identifier with which the restart (or cleanup) 
 would refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on 
 the
 checkpoint context, as well as referenced to (cow-ed) pages. This 
 data can
 persist between calls to sys_checkpoint(), and the 'crid', again, 
 will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will 
 only
 save what has changed since the previous checkpoint) there will be a 
 need
 to identify the previous checkpoints (to be able to know where to 
 take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

  ...
  crid = checkpoint(...);
  switch (crid) {
  case -1:
  perror(checkpoint failed);
  break;
  default:
  fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
  /* proceed with execution after checkpoint */
  ...
  break;
  case 0:
  fprintf(stderr, returned after restart\n);
  /* proceed with action required following a restart */
  ...
  break;
  }
  ...
 If I understand correctly, this crid can live for quite a long time. 
 So many of
 them could be generated while some container would accumulate 
 incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for 
 another
 unrelated checkpoint during that time. This brings the issue of 
 allocating crids
 reliably (using something like a pidmap for instance). Moreover, if 
 such ids are
 exposed to userspace, we need to remember which ones are allocated 
 accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Serge E. Hallyn
Quoting Louis Rilling ([EMAIL PROTECTED]):
 On Wed, Jul 30, 2008 at 10:40:35AM -0700, Dave Hansen wrote:
  On Wed, 2008-07-30 at 11:52 -0500, Serge E. Hallyn wrote:
   
   This list is getting on my nerves.  Louis, I'm sorry the threading
   is going to get messed up.
  
  I think I just cleared out the mime type filtering.
 
 Could the digital signature be the guily part of my email?

Yeah, that was Dave's guess, and it seems likely.  Dave thinks he
unset whatever setting caused the bounce, so you should be fine to
keep the signatures in there.

thanks,
-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Oren Laadan


Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 02:27:52PM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 +/**
 + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
 + * @ctx - checkpoint context
 + * @pgarr - page-array to fill
 + * @vma - vma to scan
 + * @start - start address (updated)
 + */
 +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
 +   struct vm_area_struct *vma, unsigned long *start)
 +{
 +  unsigned long end = vma-vm_end;
 +  unsigned long addr = *start;
 +  struct page **pagep;
 +  unsigned long *addrp;
 +  int cow, nr, ret = 0;
 +
 +  nr = pgarr-nleft;
 +  pagep = pgarr-pages[pgarr-nused];
 +  addrp = pgarr-addrs[pgarr-nused];
 +  cow = !!vma-vm_file;
 +
 +  while (addr  end) {
 +  struct page *page;
 +
 +  /* simplified version of get_user_pages(): already have vma,
 +  * only need FOLL_TOUCH, and (for now) ignore fault stats */
 +
 +  cond_resched();
 +  while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
 +  ret = handle_mm_fault(vma-vm_mm, vma, addr, 0);
 +  if (ret  VM_FAULT_ERROR) {
 +  if (ret  VM_FAULT_OOM)
 +  ret = -ENOMEM;
 +  else if (ret  VM_FAULT_SIGBUS)
 +  ret = -EFAULT;
 +  else
 +  BUG();
 +  break;
 +  }
 +  cond_resched();
 +  }
 I guess that 'ret' should be checked somewhere after this loop.
 yes; this is where a break(2) construct in C would come handy :)
 
 Alternatively, putting the inner loop in a separate function often helps to
 handle errors in a cleaner way.

Also true. I opted to keep it that way to keep the code as similar as
possible to get_user_pages().

Note that the logic can be optimized by, instead of traversing the page
table once for each page, we could aggregate a few pages in each round.
I wanted to keep the code simple.

 
 +
 +  if (IS_ERR(page)) {
 +  ret = PTR_ERR(page);
 +  break;
 +  }
 +
 +  if (page == ZERO_PAGE(0))
 +  page = NULL;/* zero page: ignore */
 +  else if (cow  page_mapping(page) != NULL)
 +  page = NULL;/* clean cow: ignore */
 +  else {
 +  get_page(page);
 +  *(addrp++) = addr;
 +  *(pagep++) = page;
 +  if (--nr == 0) {
 +  addr += PAGE_SIZE;
 +  break;
 +  }
 +  }
 +
 +  addr += PAGE_SIZE;
 +  }
 +
 +  if (unlikely(ret  0)) {
 +  nr = pgarr-nleft - nr;
 +  while (nr--)
 +  page_cache_release(*(--pagep));
 +  return ret;
 +  }
 +
 +  *start = addr;
 +  return (pgarr-nleft - nr);
 +}
 +
 
 
 +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
 +{
 +  struct cr_hdr h;
 +  struct cr_hdr_mm *hh = ctx-tbuf;
 +  struct mm_struct *mm;
 +  struct vm_area_struct *vma;
 +  int ret;
 +
 +  h.type = CR_HDR_MM;
 +  h.len = sizeof(*hh);
 +  h.id = ctx-pid;
 +
 +  mm = get_task_mm(t);
 +
 +  hh-tag = 1;/* non-zero will mean first time encounter */
 +
 +  hh-start_code = mm-start_code;
 +  hh-end_code = mm-end_code;
 +  hh-start_data = mm-start_data;
 +  hh-end_data = mm-end_data;
 +  hh-start_brk = mm-start_brk;
 +  hh-brk = mm-brk;
 +  hh-start_stack = mm-start_stack;
 +  hh-arg_start = mm-arg_start;
 +  hh-arg_end = mm-arg_end;
 +  hh-env_start = mm-env_start;
 +  hh-env_end = mm-env_end;
 +
 +  hh-map_count = mm-map_count;
 Some fields above should also be protected with mmap_sem, like -brk,
 -map_count, and possibly others (I'm not a memory expert though).
 true; keep in mind, though, that the container will be frozen during
 this time, so nothing should change at all. The only exception would
 be if, for instance, someone is killing the container while we save
 its state.
 
 Sure. So you think that taking mm-mmap_sem below is useless? I tend to 
 believe
 so, since no other task should share this mm_struct at this time, and we could
 state that ptrace should not interfere during restart. However, I'm never
 confident when ptrace considerations come in...

Not quite.

Probing the value of mm-brk is always safe, although it may turn out to
yield incorrect value. Traversing the vma's isn't safe, because - if for
instance the target task dies in the middle, it may alter the vma list.
So the mmap_sem protects against the latter.

Anyway, it won't hurt to be extra safe and take the semaphore earlier.

Ptrace, btw, cannot come in because the container is (supposedly) frozen.

Oren.

 +
 +  /* FIX: need also mm-flags */
 +
 +  ret = cr_write_obj(ctx, h, hh);
 +  if (ret  0)
 +  goto out;
 +
 +  /* write the vma's */
 +  down_read(mm-mmap_sem);
 +  for (vma = 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Oren Laadan


Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +int ret;
 +
 +/* FIX: need to test whether container is checkpointable */
 +
 +ret = cr_write_hdr(ctx);
 +if (!ret)
 +ret = cr_write_task(ctx, current);
 +if (!ret)
 +ret = cr_write_tail(ctx);
 +
 +/* on success, return (unique) checkpoint identifier */
 +if (!ret)
 +ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) then
 this will be the identifier with which the restart (or cleanup) would refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data can
 persist between calls to sys_checkpoint(), and the 'crid', again, will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

...
crid = checkpoint(...);
switch (crid) {
case -1:
perror(checkpoint failed);
break;
default:
fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
/* proceed with execution after checkpoint */
...
break;
case 0:
fprintf(stderr, returned after restart\n);
/* proceed with action required following a restart */
...
break;
}
...
 If I understand correctly, this crid can live for quite a long time. So 
 many of
 them could be generated while some container would accumulate incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for another
 unrelated checkpoint during that time. This brings the issue of allocating 
 crids
 reliably (using something like a pidmap for instance). Moreover, if such 
 ids are
 exposed to userspace, we need to remember which ones are allocated accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this case - the CRID are guaranteed to be unique per series of
 incremental checkpoints, and incremental chekcpoint is meaningless
 across reboots (and we can require that across migration too).
 
 Letting the kernel guess where to find the missing data of an 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Oren Laadan


Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 12:28:57PM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +  int ret;
 +
 +  /* FIX: need to test whether container is checkpointable */
 +
 +  ret = cr_write_hdr(ctx);
 +  if (!ret)
 +  ret = cr_write_task(ctx, current);
 +  if (!ret)
 +  ret = cr_write_tail(ctx);
 +
 +  /* on success, return (unique) checkpoint identifier */
 +  if (!ret)
 +  ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) 
 then
 this will be the identifier with which the restart (or cleanup) would 
 refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data 
 can
 persist between calls to sys_checkpoint(), and the 'crid', again, will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

  ...
  crid = checkpoint(...);
  switch (crid) {
  case -1:
  perror(checkpoint failed);
  break;
  default:
  fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
  /* proceed with execution after checkpoint */
  ...
  break;
  case 0:
  fprintf(stderr, returned after restart\n);
  /* proceed with action required following a restart */
  ...
  break;
  }
  ...
 If I understand correctly, this crid can live for quite a long time. So 
 many of
 them could be generated while some container would accumulate incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for 
 another
 unrelated checkpoint during that time. This brings the issue of 
 allocating crids
 reliably (using something like a pidmap for instance). Moreover, if such 
 ids are
 exposed to userspace, we need to remember which ones are allocated accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this case - the CRID are guaranteed to be unique per series of
 incremental checkpoints, and incremental chekcpoint is meaningless
 across reboots (and we can require that across migration too).
 Letting the kernel guess 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Serge E. Hallyn
Quoting Oren Laadan ([EMAIL PROTECTED]):


 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +   int ret;
 +
 +   /* FIX: need to test whether container is checkpointable */
 +
 +   ret = cr_write_hdr(ctx);
 +   if (!ret)
 +   ret = cr_write_task(ctx, current);
 +   if (!ret)
 +   ret = cr_write_tail(ctx);
 +
 +   /* on success, return (unique) checkpoint identifier */
 +   if (!ret)
 +   ret = ctx-crid;

 Does this crid have a purpose?

 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) then
 this will be the identifier with which the restart (or cleanup) would refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data can
 persist between calls to sys_checkpoint(), and the 'crid', again, will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

   ...
   crid = checkpoint(...);
   switch (crid) {
   case -1:
   perror(checkpoint failed);
   break;
   default:
   fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
   /* proceed with execution after checkpoint */
   ...
   break;
   case 0:
   fprintf(stderr, returned after restart\n);
   /* proceed with action required following a restart */
   ...
   break;
   }
   ...

Thanks - for this and the later explanations in replies to Louis.

Really I had no doubt it had a purpose :)  but wasn't sure what it was.
Quite clear now.  Thanks.

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-30 Thread Serge E. Hallyn
This list is getting on my nerves.  Louis, I'm sorry the threading
is going to get messed up.

- Forwarded message from [EMAIL PROTECTED] -

Subject: Content filtered message notification
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date: Wed, 30 Jul 2008 09:16:12 -0700

The attached message matched the containers mailing list's content
filtering rules and was prevented from being forwarded on to the list
membership.  You are receiving the only remaining copy of the
discarded message.


Date: Wed, 30 Jul 2008 18:15:35 +0200
From: Louis Rilling [EMAIL PROTECTED]
To: Oren Laadan [EMAIL PROTECTED]
Cc: Linux Containers [EMAIL PROTECTED]
Subject: Re: [RFC][PATCH 2/2] CR: handle a single task with private memory
maps
Reply-To: [EMAIL PROTECTED]

Hi Oren,

On Tue, Jul 29, 2008 at 11:27:17PM -0400, Oren Laadan wrote:
 
 Expand the template sys_checkpoint and sys_restart to be able to dump
 and restore a single task. The task's address space may consist of only
 private, simple vma's - anonymous or file-mapped.
 
 This big patch adds a mechanism to transfer data between kernel or user
 space to and from the file given by the caller (sys.c), alloc/setup/free
 of the checkpoint/restart context (sys.c), output wrappers and basic
 checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input
 wrappers and basic restart handling (restart.c), and finally the memory
 restore (rstr_mem.c).

This looks globally clean to me, but I'm sure that others will have stronger
arguments against or in favor of it.

Just a few comments inline, in case it helps.

[...]

 diff --git a/ckpt/checkpoint.c b/ckpt/checkpoint.c
 new file mode 100644
 index 000..1698a35
 --- /dev/null
 +++ b/ckpt/checkpoint.c

[...]

 +/* dump the task_struct of a given task */
 +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
 +{
 + struct cr_hdr h;
 + struct cr_hdr_task *hh = ctx-tbuf;
 +
 + h.type = CR_HDR_TASK;
 + h.len = sizeof(*hh);
 + h.id = ctx-pid;
 +
 + hh-state = t-state;
 + hh-exit_state = t-exit_state;
 + hh-exit_code = t-exit_code;
 + hh-exit_signal = t-exit_signal;
 +
 + hh-pid = t-pid;
 + hh-tgid = t-tgid;

IIRC, it is assumed that pid and tgid will be restored before actually calling
sys_restart(), eg by giving the proper pid and clone flags to a variant of
do_fork(). So, maybe these ids are useless here and should be put earlier in the
checkpoint header (see also the matching comment below in 
cr_read_task_struct()).

[...]

 diff --git a/ckpt/ckpt_mem.c b/ckpt/ckpt_mem.c
 new file mode 100644
 index 000..12caad0
 --- /dev/null
 +++ b/ckpt/ckpt_mem.c

[...]

 +/**
 + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
 + * @ctx - checkpoint context
 + * @pgarr - page-array to fill
 + * @vma - vma to scan
 + * @start - start address (updated)
 + */
 +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
 +  struct vm_area_struct *vma, unsigned long *start)
 +{
 + unsigned long end = vma-vm_end;
 + unsigned long addr = *start;
 + struct page **pagep;
 + unsigned long *addrp;
 + int cow, nr, ret = 0;
 +
 + nr = pgarr-nleft;
 + pagep = pgarr-pages[pgarr-nused];
 + addrp = pgarr-addrs[pgarr-nused];
 + cow = !!vma-vm_file;
 +
 + while (addr  end) {
 + struct page *page;
 +
 + /* simplified version of get_user_pages(): already have vma,
 + * only need FOLL_TOUCH, and (for now) ignore fault stats */
 +
 + cond_resched();
 + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
 + ret = handle_mm_fault(vma-vm_mm, vma, addr, 0);
 + if (ret  VM_FAULT_ERROR) {
 + if (ret  VM_FAULT_OOM)
 + ret = -ENOMEM;
 + else if (ret  VM_FAULT_SIGBUS)
 + ret = -EFAULT;
 + else
 + BUG();
 + break;
 + }
 + cond_resched();
 + }

I guess that 'ret' should be checked somewhere after this loop.

 +
 + if (IS_ERR(page)) {
 + ret = PTR_ERR(page);
 + break;
 + }
 +
 + if (page == ZERO_PAGE(0))
 + page = NULL;/* zero page: ignore */
 + else if (cow  page_mapping(page) != NULL)
 + page = NULL;/* clean cow: ignore */
 + else {
 + get_page(page);
 + *(addrp++) = addr;
 + *(pagep++) = page;
 + if (--nr == 0) {
 + addr += PAGE_SIZE;
 + break;
 + }
 + }
 +
 + addr += PAGE_SIZE;
 + }
 +
 + 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-30 Thread Dave Hansen
On Wed, 2008-07-30 at 11:52 -0500, Serge E. Hallyn wrote:
 
 This list is getting on my nerves.  Louis, I'm sorry the threading
 is going to get messed up.

I think I just cleared out the mime type filtering.

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-30 Thread Oren Laadan


KOSAKI Motohiro wrote:
 Hi
 
 Expand the template sys_checkpoint and sys_restart to be able to dump
 and restore a single task. The task's address space may consist of only
 private, simple vma's - anonymous or file-mapped.

 This big patch adds a mechanism to transfer data between kernel or user
 space to and from the file given by the caller (sys.c), alloc/setup/free
 of the checkpoint/restart context (sys.c), output wrappers and basic
 checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input
 wrappers and basic restart handling (restart.c), and finally the memory
 restore (rstr_mem.c).

 Signed-off-by: Oren Laadan [EMAIL PROTECTED]
 
 please write a documentation of describe memory dump file format,
 and split save and restore to two patches.

While save and restore functionality is already split to different source
files, I can easily refine the patch.

Dump file format: as agreed during the OLS, the format will be nested (as
in depth-first as opposed to breadth-first). The rationale is to be
able to stream the entire checkpoint image without file seeks. The suggested
layout looks like this:

1. Image header: information about kernel version, CR version, kernel
configuration, CPU capabilities etc.

2. Container global section: state that is global to the container, e.g.
SysV IPC, network setup.

3. Task tree/forest state: number of tasks and their relationships

4. State of each task (one by one): including task_struct state, thread
state, cpu registers, followed by memory, files, signals etc.

5. Image trailer: marking the end of the image and providing checksum and
the like.

Since this patch is only a proof-of-concept, it has a very simple #1,
no #2 or #3, limited #4 and very simple #5.

This patch still doesn't handle shared objects, but they will be handled
as follows: the first time a shared object is accessed (to dump it) it is
given a unique identifier and dumped in full. The next time(s) the object
is found, only the identifier is saved instead.

A bit more specific about the format: it will be composed of records,
such that each record has a pre-header that identifies its contents and a
payload. (The idea here is to enable parallel checkpointing in the future
in which multiple threads interleave data from multiple processes into
a single stream).

The pre-header is:

struct cr_hdr {
__s16 type;
__s16 len;
__u32 id;
};

'type' identified the type of the following payload, 'len' tells its length.
The 'id' identifies the object instance to which it belongs (it is currently
unused). The meaning of the 'id' field may vary depending on the type. For
example, for type CR_HDR_MM, the 'id' will identify the task to which this
MM belongs. The payload varies depending on its type, for instance, the data
describing a task_struct is given by a 'struct cr_hdr_task' (type CR_HDR_TASK)
and so on.

The format of the memory dump is slightly different: for each vma, there is
a 'struct cr_vma'; if the vma is file-mapped, it will be followed by the file
name. The cr_vma-npages will tell how many pages were dumped for this vma.
Then it will be followed by the actual data: first a dump of the addresses of
all dumped pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next vma and so on.

For a single simple task, the format of the resulting checkpoint image would
look like this (assume 2 vma's, one file mapped with 2 dumped pages and the
other anonymous with 3 dumped pages):

cr_hdr + cr_hdr_head
cr_hdr + cr_hdr_task
cr_hdr + cr_hdr_mm
cr_hdr + cr_hdr_vma + cr_hdr + string
addr1, addr2
page1, page2
cr_hdr + cr_hdr_vma
addr3, addr4, addr5
page3, page4, page5
cr_hdr + cr_mm_context
cr_hdr + cr_hdr_thread
cr_hdr + cr_hdr_cpu
cr_hdr + cr_hdr_tail

Will add this documentation to the next version of the patch.

Oren.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-30 Thread Oren Laadan


Louis Rilling wrote:
 Hi Oren,
 
 On Tue, Jul 29, 2008 at 11:27:17PM -0400, Oren Laadan wrote:
 Expand the template sys_checkpoint and sys_restart to be able to dump
 and restore a single task. The task's address space may consist of only
 private, simple vma's - anonymous or file-mapped.

 This big patch adds a mechanism to transfer data between kernel or user
 space to and from the file given by the caller (sys.c), alloc/setup/free
 of the checkpoint/restart context (sys.c), output wrappers and basic
 checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input
 wrappers and basic restart handling (restart.c), and finally the memory
 restore (rstr_mem.c).
 
 This looks globally clean to me, but I'm sure that others will have stronger
 arguments against or in favor of it.
 
 Just a few comments inline, in case it helps.
 
 [...]
 
 diff --git a/ckpt/checkpoint.c b/ckpt/checkpoint.c
 new file mode 100644
 index 000..1698a35
 --- /dev/null
 +++ b/ckpt/checkpoint.c
 
 [...]
 
 +/* dump the task_struct of a given task */
 +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
 +{
 +struct cr_hdr h;
 +struct cr_hdr_task *hh = ctx-tbuf;
 +
 +h.type = CR_HDR_TASK;
 +h.len = sizeof(*hh);
 +h.id = ctx-pid;
 +
 +hh-state = t-state;
 +hh-exit_state = t-exit_state;
 +hh-exit_code = t-exit_code;
 +hh-exit_signal = t-exit_signal;
 +
 +hh-pid = t-pid;
 +hh-tgid = t-tgid;
 
 IIRC, it is assumed that pid and tgid will be restored before actually calling
 sys_restart(), eg by giving the proper pid and clone flags to a variant of
 do_fork(). So, maybe these ids are useless here and should be put earlier in 
 the
 checkpoint header (see also the matching comment below in 
 cr_read_task_struct()).

oops .. left-overs -- definitely don't belong there anymore.

 
 [...]
 
 diff --git a/ckpt/ckpt_mem.c b/ckpt/ckpt_mem.c
 new file mode 100644
 index 000..12caad0
 --- /dev/null
 +++ b/ckpt/ckpt_mem.c
 
 [...]
 
 +/**
 + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
 + * @ctx - checkpoint context
 + * @pgarr - page-array to fill
 + * @vma - vma to scan
 + * @start - start address (updated)
 + */
 +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
 + struct vm_area_struct *vma, unsigned long *start)
 +{
 +unsigned long end = vma-vm_end;
 +unsigned long addr = *start;
 +struct page **pagep;
 +unsigned long *addrp;
 +int cow, nr, ret = 0;
 +
 +nr = pgarr-nleft;
 +pagep = pgarr-pages[pgarr-nused];
 +addrp = pgarr-addrs[pgarr-nused];
 +cow = !!vma-vm_file;
 +
 +while (addr  end) {
 +struct page *page;
 +
 +/* simplified version of get_user_pages(): already have vma,
 +* only need FOLL_TOUCH, and (for now) ignore fault stats */
 +
 +cond_resched();
 +while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
 +ret = handle_mm_fault(vma-vm_mm, vma, addr, 0);
 +if (ret  VM_FAULT_ERROR) {
 +if (ret  VM_FAULT_OOM)
 +ret = -ENOMEM;
 +else if (ret  VM_FAULT_SIGBUS)
 +ret = -EFAULT;
 +else
 +BUG();
 +break;
 +}
 +cond_resched();
 +}
 
 I guess that 'ret' should be checked somewhere after this loop.

yes; this is where a break(2) construct in C would come handy :)

 
 +
 +if (IS_ERR(page)) {
 +ret = PTR_ERR(page);
 +break;
 +}
 +
 +if (page == ZERO_PAGE(0))
 +page = NULL;/* zero page: ignore */
 +else if (cow  page_mapping(page) != NULL)
 +page = NULL;/* clean cow: ignore */
 +else {
 +get_page(page);
 +*(addrp++) = addr;
 +*(pagep++) = page;
 +if (--nr == 0) {
 +addr += PAGE_SIZE;
 +break;
 +}
 +}
 +
 +addr += PAGE_SIZE;
 +}
 +
 +if (unlikely(ret  0)) {
 +nr = pgarr-nleft - nr;
 +while (nr--)
 +page_cache_release(*(--pagep));
 +return ret;
 +}
 +
 +*start = addr;
 +return (pgarr-nleft - nr);
 +}
 +
 
 [...]
 
 +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
 +{
 +struct cr_hdr h;
 +struct cr_hdr_mm *hh = ctx-tbuf;
 +struct mm_struct *mm;
 +struct vm_area_struct *vma;
 +int ret;
 +
 +h.type = CR_HDR_MM;
 +h.len = sizeof(*hh);
 +h.id = ctx-pid;
 +
 +mm = get_task_mm(t);
 +
 +hh-tag = 1;/* non-zero will mean first time encounter */
 +
 +

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-30 Thread Serge E. Hallyn
Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 + int ret;
 +
 + /* FIX: need to test whether container is checkpointable */
 +
 + ret = cr_write_hdr(ctx);
 + if (!ret)
 + ret = cr_write_task(ctx, current);
 + if (!ret)
 + ret = cr_write_tail(ctx);
 +
 + /* on success, return (unique) checkpoint identifier */
 + if (!ret)
 + ret = ctx-crid;

Does this crid have a purpose?

 +
 + return ret;
 +}
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-30 Thread Oren Laadan


Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +int ret;
 +
 +/* FIX: need to test whether container is checkpointable */
 +
 +ret = cr_write_hdr(ctx);
 +if (!ret)
 +ret = cr_write_task(ctx, current);
 +if (!ret)
 +ret = cr_write_tail(ctx);
 +
 +/* on success, return (unique) checkpoint identifier */
 +if (!ret)
 +ret = ctx-crid;
 
 Does this crid have a purpose?

yes, at least three; both are for the future, but important to set the
meaning of the return value of the syscall already now. The crid is
the CR-identifier that identifies the checkpoint. Every checkpoint is
assigned a unique number (using an atomic counter).

1) if a checkpoint is taken and kept in memory (instead of to a file) then
this will be the identifier with which the restart (or cleanup) would refer
to the (in memory) checkpoint image

2) to reduce downtime of the checkpoint, data will be aggregated on the
checkpoint context, as well as referenced to (cow-ed) pages. This data can
persist between calls to sys_checkpoint(), and the 'crid', again, will be
used to identify the (in-memory-to-be-dumped-to-storage) context.

3) for incremental checkpoint (where a successive checkpoint will only
save what has changed since the previous checkpoint) there will be a need
to identify the previous checkpoints (to be able to know where to take
data from during restart). Again, a 'crid' is handy.

[in fact, for the 3rd use, it will make sense to write that number as
part of the checkpoint image header]

Note that by doing so, a process that checkpoints itself (in its own
context), can use code that is similar to the logic of fork():

...
crid = checkpoint(...);
switch (crid) {
case -1:
perror(checkpoint failed);
break;
default:
fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
/* proceed with execution after checkpoint */
...
break;
case 0:
fprintf(stderr, returned after restart\n);
/* proceed with action required following a restart */
...
break;
}
...

Oren.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-30 Thread Dave Hansen
On Tue, 2008-07-29 at 23:27 -0400, Oren Laadan wrote: 
 Expand the template sys_checkpoint and sys_restart to be able to dump 
 and restore a single task. The task's address space may consist of only
 private, simple vma's - anonymous or file-mapped.

So, can we all agree that this is a good example of the in-kernel
checkpoint/restart approach?  It may not be the smallest possible
example, but it certainly demonstrates the approach for me.

-- Dave

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-29 Thread KOSAKI Motohiro
Hi

 Expand the template sys_checkpoint and sys_restart to be able to dump
 and restore a single task. The task's address space may consist of only
 private, simple vma's - anonymous or file-mapped.
 
 This big patch adds a mechanism to transfer data between kernel or user
 space to and from the file given by the caller (sys.c), alloc/setup/free
 of the checkpoint/restart context (sys.c), output wrappers and basic
 checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input
 wrappers and basic restart handling (restart.c), and finally the memory
 restore (rstr_mem.c).
 
 Signed-off-by: Oren Laadan [EMAIL PROTECTED]

please write a documentation of describe memory dump file format,
and split save and restore to two patches.


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel