[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-20 Thread Oren Laadan


Pavel Machek wrote:
 Hi!
 
 I have to wonder if this is just a symptom of us trying to do this the
 wrong way.  We're trying to talk the kernel into writing internal gunk
 into a FD.  You're right, it is like a splice where one end of the pipe
 is in the kernel.

 Any thoughts on a better way to do this?  
 Maybe you can invert the logic and let the new syscalls create a file
 descriptor, and then have user space read or splice the checkpoint
 data from it, and restore it by writing to the file descriptor.
 It's probably easy to do using anon_inode_getfd() and would solve this
 problem, but at the same time make checkpointing the current thread
 hard if not impossible.
 Yeah, it does seem kinda backwards.  But, instead of even having to
 worry about the anon_inode stuff, why don't we just put it in a fs like
 everything else?  checkpointfs!
 One reason is that I suspect that stops us from being able to send that
 data straight to a pipe to compress and/or send on the network, without
 hitting local disk.  Though if the checkpointfs was ram-based maybe not?

 As Oren has pointed out before, passing in an fd means we can pass a
 socket into the syscall.
 
 If you do pass a socket, will it handle blocking correctly? Getting
 deadlocked task would be bad. What happens if I try to snapshot into
 /proc/self/fd/0 ? Or maybe restore from /proc/cmdline?

Hmmm... these are good points.

Keep in mind that our principal goal is to checkpoint a whole container,
rather then a task to checkpoint itself (which is a by-product). Of course
your comments apply to a whole container as well.

In both cases, I don't think that blocking on a socket is a problem; the
checkpointer will enter a TASK_INTERRUPTIBLE state. Where is the deadlock ?
Writing or reading to/from /proc/self/... likewise - the programmer must
understand the implications, or the program won't work as expected. I don't
see a possible deadlock here, though.

For example - writing to /proc/self/fd/0 is ok; the state of fd[0] of that
task will be captured at some point in the middle of the checkpoint, so
after restart one cannot assume anything about the file position; the rest
should work.

Oren.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-14 Thread Pavel Emelyanov
Oren Laadan wrote:
 
 Dave Hansen wrote:
 On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
 On Friday 08 August 2008, Dave Hansen wrote:
 +  hh-magic = 0x00a2d200;
 +  hh-major = (LINUX_VERSION_CODE  16)  0xff;
 +  hh-minor = (LINUX_VERSION_CODE  8)  0xff;
 +  hh-patch = (LINUX_VERSION_CODE)  0xff;
 ...
 +}
 Do you rely on the kernel version in order to determine the format
 of the binary data, or is it just informational?

 If you think the format can change in incompatible ways, you
 probably need something more specific than the version number
 these days, because there are just so many different trees with
 the same numbers.
 Yeah, this is very true.  My guess is that we'll need something like
 what we do with modversions. 
 
 Exactly. The header should eventually contain sufficient information
 to describe the kernel version, configuration, compiler, cpu (arch and
 capabilities), and checkpoint code version.

Sorry for late response - I was on vacation till Wednesday.

I am opposed against having *explicit* information about the kernel
configuration inside the image.

I see this like we store object images in a file, and these images do 
*not* change depending on the .config. But instead of this, at the time 
of restore we may *only* detect whether we can restore this type of
object or not.

E.g. consider we are saving a container image on ipv6 node and trying 
to restore from it on the one without the ipv6. In that case we *may*
have some object of for example CKPT_IPV6_IFA type of CLPT_IPV6_SOCK_INFO
and fail the restoration process when finding such in an input file. But 
what we should *not* do is to write any information about whether we had
the CONFIG_IPV6 turned on on the dumping side and check for this on the
restoring side.

(Sorry, if this question is already settled, but the discussion thread
 built by my mailer is a bit messy, so I suspect I could miss some part
 of the threads)

 How would you suggest to identify the origin tree with an identifier
 (or a text field) in the header ?
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-14 Thread Pavel Machek
Hi!

I have to wonder if this is just a symptom of us trying to do this the
wrong way.  We're trying to talk the kernel into writing internal gunk
into a FD.  You're right, it is like a splice where one end of the pipe
is in the kernel.

Any thoughts on a better way to do this?  
   
   Maybe you can invert the logic and let the new syscalls create a file
   descriptor, and then have user space read or splice the checkpoint
   data from it, and restore it by writing to the file descriptor.
   It's probably easy to do using anon_inode_getfd() and would solve this
   problem, but at the same time make checkpointing the current thread
   hard if not impossible.
  
  Yeah, it does seem kinda backwards.  But, instead of even having to
  worry about the anon_inode stuff, why don't we just put it in a fs like
  everything else?  checkpointfs!
 
 One reason is that I suspect that stops us from being able to send that
 data straight to a pipe to compress and/or send on the network, without
 hitting local disk.  Though if the checkpointfs was ram-based maybe not?
 
 As Oren has pointed out before, passing in an fd means we can pass a
 socket into the syscall.

If you do pass a socket, will it handle blocking correctly? Getting
deadlocked task would be bad. What happens if I try to snapshot into
/proc/self/fd/0 ? Or maybe restore from /proc/cmdline?

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-14 Thread Dave Hansen
On Thu, 2008-08-14 at 07:53 +0200, Pavel Machek wrote:
  As Oren has pointed out before, passing in an fd means we can pass a
  socket into the syscall.
 
 If you do pass a socket, will it handle blocking correctly? Getting
 deadlocked task would be bad. What happens if I try to snapshot into
 /proc/self/fd/0 ? Or maybe restore from /proc/cmdline?

Heh, that's a good point.  What was the other code where we kept coming
up with deadlocks like that?  Anyone remember?

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-14 Thread Dave Hansen
On Thu, 2008-08-14 at 12:09 +0400, Pavel Emelyanov wrote:
 E.g. consider we are saving a container image on ipv6 node and trying 
 to restore from it on the one without the ipv6. In that case we *may*
 have some object of for example CKPT_IPV6_IFA type of CLPT_IPV6_SOCK_INFO
 and fail the restoration process when finding such in an input file. But 
 what we should *not* do is to write any information about whether we had
 the CONFIG_IPV6 turned on on the dumping side and check for this on the
 restoring side.

The only problem I can see with this is that you lose efficiency,
especially when you have to build your checkpoint image with lots of
things that are config-specific.

The approach sounds like a good one in theory, but I'm a bit skeptical
that we could stick to it in practice, in a mainline kernel where there
are billions of config options.  It is definitely something to strive
for, though.  Good point!

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Serge E. Hallyn
Quoting Arnd Bergmann ([EMAIL PROTECTED]):
 On Friday 08 August 2008, Dave Hansen wrote:
  On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
 
+struct cr_hdr_tail {
+   __u32 magic;
+   __u32 cksum[2];
+};
   
   This structure has an odd multiple of 32-bit members, which means
   that if you put it into a larger structure that also contains
   64-bit members, the larger structure may get different alignment
   on x86-32 and x86-64, which you might want to avoid.
   I can't tell if this is an actual problem here.
  
  Can't we just declare all these things __packed__ and stop worrying
  about aligning them all manually?
 
 I personally dislike __packed__ because it makes it very easy to get
 suboptimal object code. If you either pad every structure to a multiple
 of 64 bits or avoid __u64 members, you don't have a problem. Also,
 I think avoiding implicit padding inside of data structures is very
 helpful for user interfaces, if necessary you can always add explicit
 padding.
 
   get_fs()/set_fs() always feels a bit ouch, and this way you have
   to use __force to avoid the warnings about __user pointer casts
   in sparse.
   I wonder if you can use splice_read/splice_write to get around
   this problem.
  
  I have to wonder if this is just a symptom of us trying to do this the
  wrong way.  We're trying to talk the kernel into writing internal gunk
  into a FD.  You're right, it is like a splice where one end of the pipe
  is in the kernel.
  
  Any thoughts on a better way to do this?  
 
 Maybe you can invert the logic and let the new syscalls create a file
 descriptor, and then have user space read or splice the checkpoint
 data from it, and restore it by writing to the file descriptor.
 It's probably easy to do using anon_inode_getfd() and would solve this
 problem, but at the same time make checkpointing the current thread
 hard if not impossible.
 
  Yes, eventually.  I think one good point is that we should probably
  remove this now so that we *have* to think about security implications
  as we add each individual patch.  For instance, what kind of checking do
  we do when we restore an mlock()'d VMA?
 
 I think the question can be generalized further: How do you deal with
 saved tasks that have more priviledges than the task doing the restore?
 
 There are probably more, but what I can think of right now includes:
 * anything you can set using ulimit
 * capabilities
 * threads running as another user/group
 * open files that have had their permissions changed after the open

At the checkpoint end, the ptrace checks seem apporpriate:  If you're
allowed to stop and manipulate the process, then you may as well be
allowed to checkpoint and see/tweak its memory that way.

At the restart end, every resource which was checkpointed will have to
be re-created, and permissions checked against the privilege of the
task which did the restart.  We may end up having to make use of the new
credentials for this.

This could become unpleasant: if an unprivileged task asked a privileged
helper to create something for the unprivileged task to use (i.e. a
raw socket), then the user needs to be privileged to re-created the
resource.  But it's necessary.

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Arnd Bergmann
On Monday 11 August 2008, Serge E. Hallyn wrote:
 One reason is that I suspect that stops us from being able to send that
 data straight to a pipe to compress and/or send on the network, without
 hitting local disk.  Though if the checkpointfs was ram-based maybe not?
 
 As Oren has pointed out before, passing in an fd means we can pass a
 socket into the syscall.
 
 Using the anon_inodes would also prevent that, but if it makes for a
 cleaner overall solution then I'm not against considering either one
 of course.
 

With anon_inodes, you can still implement splice_read/splice_write,
so you can splice it into a socket.

Arnd 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Serge E. Hallyn
Quoting Dave Hansen ([EMAIL PROTECTED]):
 On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote:
   I have to wonder if this is just a symptom of us trying to do this the
   wrong way.  We're trying to talk the kernel into writing internal gunk
   into a FD.  You're right, it is like a splice where one end of the pipe
   is in the kernel.
   
   Any thoughts on a better way to do this?  
  
  Maybe you can invert the logic and let the new syscalls create a file
  descriptor, and then have user space read or splice the checkpoint
  data from it, and restore it by writing to the file descriptor.
  It's probably easy to do using anon_inode_getfd() and would solve this
  problem, but at the same time make checkpointing the current thread
  hard if not impossible.
 
 Yeah, it does seem kinda backwards.  But, instead of even having to
 worry about the anon_inode stuff, why don't we just put it in a fs like
 everything else?  checkpointfs!

One reason is that I suspect that stops us from being able to send that
data straight to a pipe to compress and/or send on the network, without
hitting local disk.  Though if the checkpointfs was ram-based maybe not?

As Oren has pointed out before, passing in an fd means we can pass a
socket into the syscall.

Using the anon_inodes would also prevent that, but if it makes for a
cleaner overall solution then I'm not against considering either one
of course.

 I'm also really not convinced that putting the entire checkpoint in one
 glob is really the solution, either.  I mean, is system call overhead
 really a problem here?
 
 -- Dave
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Arnd Bergmann
On Monday 11 August 2008, Serge E. Hallyn wrote:
 At the restart end, every resource which was checkpointed will have to
 be re-created, and permissions checked against the privilege of the
 task which did the restart.  We may end up having to make use of the new
 credentials for this.
 
 This could become unpleasant: if an unprivileged task asked a privileged
 helper to create something for the unprivileged task to use (i.e. a
 raw socket), then the user needs to be privileged to re-created the
 resource.  But it's necessary.

Right. Of course, the hard part here will be to make it obvious to
be safe. Having to check all sorts of permissions means there will
be many opportunities for exploitable bugs.

The best way I can think of for this would be to use existing syscalls
(e.g. sched_setscheduler, setfsuid, ...) from user space whereever
possible and do only the bare minimum for the restart part in the kernel.

Arnd 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Dave Hansen
On Mon, 2008-08-11 at 18:53 +0200, Arnd Bergmann wrote:
 The best way I can think of for this would be to use existing syscalls
 (e.g. sched_setscheduler, setfsuid, ...) from user space whereever
 possible and do only the bare minimum for the restart part in the
 kernel.

Well, the current direction is about as far away from that as you can
get, unless we basically call those system calls from inside our new
sys_restart() one.  As of now, we're as much work in the kernel as
possible, and doing the bare minimum in userspace.  That's what both
Oren and our OpenVZ colleagues have advocated.

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Jonathan Corbet
I'm trying to figure out this patch set...here's a few things which
have caught my eye in passing.

 +/**
 + * cr_get_fname - return pathname of a given file
 + * @file: file pointer
 + * @buf: buffer for pathname
 + * @n: buffer length (in) and pathname length (out)
 + *
 + * if the buffer provivded by the caller is too small, allocate a new
 + * buffer; caller should call cr_put_pathname() for cleanup
 + */
 +char *cr_get_fname(struct path *path, struct path *root, char *buf,
 int *n) +{
 + char *fname;
 +
 + fname = __d_path(path, root, buf, *n);
 +
 + if (IS_ERR(fname)  PTR_ERR(fname) == -ENAMETOOLONG) {
 +  if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))

This seems like a clunky and error-prone interface - why not just have it
allocate the memory always?  But, in this case, cr_get_fname() always seems
to be called with ctx-tbuf, which, in turn, is an order-1 allocation.
Here you're saying that if it's too small, you'll try replacing it with an
order-0 allocation instead.  I rather suspect that's not going to help.

 +/* write the checkpoint header */
 +static int cr_write_hdr(struct cr_ctx *ctx)
 +{
 + struct cr_hdr h;
 + struct cr_hdr_head *hh = ctx-tbuf;
 + struct timeval ktv;
 +
 + h.type = CR_HDR_HEAD;
 + h.len = sizeof(*hh);
 + h.id = 0;
 +
 + do_gettimeofday(ktv);
 +
 + hh-magic = 0x00a2d200;

This magic number is hard-coded in a number of places.  Could it maybe
benefit from a macro, which, in turn, could maybe end up in linux/magic.h?

 +/* dump the task_struct of a given task */
 +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)

This function is going to break every time somebody changes struct
task_struct.  I'm not quite sure how to prevent that.  I wonder if the
modversions stuff could somehow be employed to detect changes and make the
build fail?

 +/**
 + * sys_checkpoint - checkpoint a container
 + * @pid: pid of the container init(1) process
 + * @fd: file to which dump the checkpoint image
 + * @flags: checkpoint operation flags
 + */
 +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long
 flags) +{
 + struct cr_ctx *ctx;
 + struct file *file;
 + int fput_needed;
 + int ret;
 +
 + if (!capable(CAP_SYS_ADMIN))
 + return -EPERM;

Like others, I wondered why CAP_SYS_ADMIN was required here.  I *still*
wonder, though, how you'll ever be able to do restart without a privilege
check.  There must be a thousand ways to compromise a system by messing
with the checkpoint file.

 + file = fget_light(fd, fput_needed);
 + if (!file)
 + return -EBADF;

Should you maybe check for write access?  An attempt to overwrite a
read-only file won't succeed, but you could save a lot of work by just
failing it with a clear code here.  

What about the file position?  Perhaps there could be a good reason to
checkpoint a process into the middle of a file, don't know.

In general, I don't see a whole lot of locking going on.  Is it really
possible to save and restore memory without ever holding mmap_sem?

jon
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Dave Hansen
On Mon, 2008-08-11 at 12:03 -0600, Jonathan Corbet wrote:
 I'm trying to figure out this patch set...here's a few things which
 have caught my eye in passing.
 
  +/**
  + * cr_get_fname - return pathname of a given file
  + * @file: file pointer
  + * @buf: buffer for pathname
  + * @n: buffer length (in) and pathname length (out)
  + *
  + * if the buffer provivded by the caller is too small, allocate a new
  + * buffer; caller should call cr_put_pathname() for cleanup
  + */
  +char *cr_get_fname(struct path *path, struct path *root, char *buf,
  int *n) +{
  +   char *fname;
  +
  +   fname = __d_path(path, root, buf, *n);
  +
  +   if (IS_ERR(fname)  PTR_ERR(fname) == -ENAMETOOLONG) {
  +if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
 
 This seems like a clunky and error-prone interface - why not just have it
 allocate the memory always?  But, in this case, cr_get_fname() always seems
 to be called with ctx-tbuf, which, in turn, is an order-1 allocation.
 Here you're saying that if it's too small, you'll try replacing it with an
 order-0 allocation instead.  I rather suspect that's not going to help.

Yeah, it doesn't make much sense on the surface.  I would imagine that
this has some use for when we're stacking things up in the ctx-hbuf
rather than just using it as a completely temporary buffer.  But, in any
case, it doesn't make sense as it stands now, so I think it needs to be
reworked.

  +/* write the checkpoint header */
  +static int cr_write_hdr(struct cr_ctx *ctx)
  +{
  +   struct cr_hdr h;
  +   struct cr_hdr_head *hh = ctx-tbuf;
  +   struct timeval ktv;
  +
  +   h.type = CR_HDR_HEAD;
  +   h.len = sizeof(*hh);
  +   h.id = 0;
  +
  +   do_gettimeofday(ktv);
  +
  +   hh-magic = 0x00a2d200;
 
 This magic number is hard-coded in a number of places.  Could it maybe
 benefit from a macro, which, in turn, could maybe end up in linux/magic.h?
 
  +/* dump the task_struct of a given task */
  +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
 
 This function is going to break every time somebody changes struct
 task_struct.  I'm not quite sure how to prevent that.  I wonder if the
 modversions stuff could somehow be employed to detect changes and make the
 build fail?

In general, I think any time that we are checkpointing $THING and $THING
changes, the checkpoint will break.  It just so happens that all we're
checkpointing here is the task_struct, so $THING == task_struct for
now. :)

The things that *really* worry me are things like when flags change
semantics subtly.  Or, let's say a flag is used for two different things
in 2.6.26.4 vs 2.6.27.  I'm not sure we're ever going to be in a
position to find and fix up stuff like that.

That's one reason I have been advocating doing checkpoint/restart in
much tinier bits so that we can understand each of them as we go along.
I also think that the *less* we expose to userspace, the better.

  +/**
  + * sys_checkpoint - checkpoint a container
  + * @pid: pid of the container init(1) process
  + * @fd: file to which dump the checkpoint image
  + * @flags: checkpoint operation flags
  + */
  +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long
  flags) +{
  +   struct cr_ctx *ctx;
  +   struct file *file;
  +   int fput_needed;
  +   int ret;
  +
  +   if (!capable(CAP_SYS_ADMIN))
  +   return -EPERM;
 
 Like others, I wondered why CAP_SYS_ADMIN was required here.  I *still*
 wonder, though, how you'll ever be able to do restart without a privilege
 check.  There must be a thousand ways to compromise a system by messing
 with the checkpoint file.

As with everything else coming from userspace, the checkpoint file
should be completely untrusted.  I do think, though, that the ptrace
analogy that Serge?? made is a good one.

  +   file = fget_light(fd, fput_needed);
  +   if (!file)
  +   return -EBADF;
 
 Should you maybe check for write access?  An attempt to overwrite a
 read-only file won't succeed, but you could save a lot of work by just
 failing it with a clear code here.  

That's true.  I'll take a look and see.

This patch does reach down and use vfs_write() at some point.  There
really aren't any other in-kernel users that do this (short of ecryptfs
and plan9fs).  That makes me doubt that we're even using a good approach
here.

 What about the file position?  Perhaps there could be a good reason to
 checkpoint a process into the middle of a file, don't know.

I think this is a good example of a place where the kernel can let
userspace shoot itself in its foot if it wants.  We might also want to
allow things to be sent over fds that don't necessarily have positions,
like sockets or pipes.

 In general, I don't see a whole lot of locking going on.  Is it really
 possible to save and restore memory without ever holding mmap_sem?

I personally haven't audited the locking, yet.  It is going to be fun!

But, take a look in patch 3/4:

+   /* write the vma's */
+   

[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-11 Thread Oren Laadan


Dave Hansen wrote:
 On Mon, 2008-08-11 at 12:03 -0600, Jonathan Corbet wrote:
 I'm trying to figure out this patch set...here's a few things which
 have caught my eye in passing.

 +/**
 + * cr_get_fname - return pathname of a given file
 + * @file: file pointer
 + * @buf: buffer for pathname
 + * @n: buffer length (in) and pathname length (out)
 + *
 + * if the buffer provivded by the caller is too small, allocate a new
 + * buffer; caller should call cr_put_pathname() for cleanup
 + */
 +char *cr_get_fname(struct path *path, struct path *root, char *buf,
 int *n) +{
 +   char *fname;
 +
 +   fname = __d_path(path, root, buf, *n);
 +
 +   if (IS_ERR(fname)  PTR_ERR(fname) == -ENAMETOOLONG) {
 +if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
 This seems like a clunky and error-prone interface - why not just have it
 allocate the memory always?  But, in this case, cr_get_fname() always seems
 to be called with ctx-tbuf, which, in turn, is an order-1 allocation.
 Here you're saying that if it's too small, you'll try replacing it with an
 order-0 allocation instead.  I rather suspect that's not going to help.
 
 Yeah, it doesn't make much sense on the surface.  I would imagine that
 this has some use for when we're stacking things up in the ctx-hbuf
 rather than just using it as a completely temporary buffer.  But, in any
 case, it doesn't make sense as it stands now, so I think it needs to be
 reworked.

Dave is right on the money: in Zap (the equivalent of) cr_get_fname()
may be called with a buffer smaller than PATH_MAX (one page) and hence
the need to allocate ad-hoc. Indeed in the current code this is not the
case (yet?) so I'll go ahead and simplify it.

 
 +/* write the checkpoint header */
 +static int cr_write_hdr(struct cr_ctx *ctx)
 +{
 +   struct cr_hdr h;
 +   struct cr_hdr_head *hh = ctx-tbuf;
 +   struct timeval ktv;
 +
 +   h.type = CR_HDR_HEAD;
 +   h.len = sizeof(*hh);
 +   h.id = 0;
 +
 +   do_gettimeofday(ktv);
 +
 +   hh-magic = 0x00a2d200;
 This magic number is hard-coded in a number of places.  Could it maybe
 benefit from a macro, which, in turn, could maybe end up in linux/magic.h?

 +/* dump the task_struct of a given task */
 +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
 This function is going to break every time somebody changes struct
 task_struct.  I'm not quite sure how to prevent that.  I wonder if the
 modversions stuff could somehow be employed to detect changes and make the
 build fail?
 
 In general, I think any time that we are checkpointing $THING and $THING
 changes, the checkpoint will break.  It just so happens that all we're
 checkpointing here is the task_struct, so $THING == task_struct for
 now. :)
 
 The things that *really* worry me are things like when flags change
 semantics subtly.  Or, let's say a flag is used for two different things
 in 2.6.26.4 vs 2.6.27.  I'm not sure we're ever going to be in a
 position to find and fix up stuff like that.

One way to reduce the risk is to use an intermediate representation to
kernel native data and properties (e.g. classify VMAs during checkpoint
instead of relying blindly on the flags).

The problem is not so much in restarting a checkpoint image from old
kernel on a new kernel - that can be handled by conversion in user space.

Tracking changes affecting the checkpoint/restart logic - well, if
eventually checkpoint/restart gets to becomes main-stream enough that
developers will be aware of it.

 
 That's one reason I have been advocating doing checkpoint/restart in
 much tinier bits so that we can understand each of them as we go along.
 I also think that the *less* we expose to userspace, the better.
 
 +/**
 + * sys_checkpoint - checkpoint a container
 + * @pid: pid of the container init(1) process
 + * @fd: file to which dump the checkpoint image
 + * @flags: checkpoint operation flags
 + */
 +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long
 flags) +{
 +   struct cr_ctx *ctx;
 +   struct file *file;
 +   int fput_needed;
 +   int ret;
 +
 +   if (!capable(CAP_SYS_ADMIN))
 +   return -EPERM;
 Like others, I wondered why CAP_SYS_ADMIN was required here.  I *still*
 wonder, though, how you'll ever be able to do restart without a privilege
 check.  There must be a thousand ways to compromise a system by messing
 with the checkpoint file.
 
 As with everything else coming from userspace, the checkpoint file
 should be completely untrusted.  I do think, though, that the ptrace
 analogy that Serge?? made is a good one.

The only reason I made the analogy without actually implementing it is lack
of time and familiarity with the ptrace permission world.

 
 +   file = fget_light(fd, fput_needed);
 +   if (!file)
 +   return -EBADF;
 Should you maybe check for write access?  An attempt to overwrite a
 read-only file won't succeed, but you could save a lot of work by just
 failing it with a clear code here.  
 
 That's true.  I'll take a 

[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-09 Thread Arnd Bergmann
On Saturday 09 August 2008, Dave Hansen wrote:
 On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote:
  The main problem I see with that would be atomicity: If you want multiple
  processes to keep interacting with each other, you need to save them at
  the same point in time, which gets harder as you split your interface into
  more than a single file descriptor.
 
 It could take ages to write out a checkpoint even to a single fd, so I
 suspect we'd have the exact same kinds of issues either way.

I guess either way, you have to SIGSTOP (or similar) all the tasks you want
to checkpoint atomically before you start saving the contents.
If you use a single fd, you can do that under the covers, when using a 
more complex file system, it seems more logical to require an explicit
interface for this.

Arnd 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-09 Thread Dave Hansen
On Sat, 2008-08-09 at 08:37 +0200, Arnd Bergmann wrote:
 On Saturday 09 August 2008, Dave Hansen wrote:
  On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote:
   The main problem I see with that would be atomicity: If you want multiple
   processes to keep interacting with each other, you need to save them at
   the same point in time, which gets harder as you split your interface into
   more than a single file descriptor.
  
  It could take ages to write out a checkpoint even to a single fd, so I
  suspect we'd have the exact same kinds of issues either way.
 
 I guess either way, you have to SIGSTOP (or similar) all the tasks you want
 to checkpoint atomically before you start saving the contents.
 If you use a single fd, you can do that under the covers, when using a 
 more complex file system, it seems more logical to require an explicit
 interface for this.

Oh, we're already working on patches to the freezer code to do this for
us.  There's a branch in here from Matt H. that's doing just that:

http://git.kernel.org/?p=linux/kernel/git/daveh/linux-2.6-next-lxc.git;a=shortlog

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Arnd Bergmann
On Friday 08 August 2008, Dave Hansen wrote:
 +/* write the checkpoint header */
 +static int cr_write_hdr(struct cr_ctx *ctx)
 +{
 + struct cr_hdr h;
 + struct cr_hdr_head *hh = ctx-tbuf;
 + struct timeval ktv;
 +
 + h.type = CR_HDR_HEAD;
 + h.len = sizeof(*hh);
 + h.id = 0;
 +
 + do_gettimeofday(ktv);
 +
 + hh-magic = 0x00a2d200;
 + hh-major = (LINUX_VERSION_CODE  16)  0xff;
 + hh-minor = (LINUX_VERSION_CODE  8)  0xff;
 + hh-patch = (LINUX_VERSION_CODE)  0xff;
 +
 + hh-version = 1;
 +
 + hh-flags = ctx-flags;
 + hh-time = ktv.tv_sec;
 +
 + return cr_write_obj(ctx, h, hh);
 +}

Do you rely on the kernel version in order to determine the format
of the binary data, or is it just informational?

If you think the format can change in incompatible ways, you
probably need something more specific than the version number
these days, because there are just so many different trees with
the same numbers.


 +
 +/* debugging */
 +#if 0
 +#define CR_PRINTK(str, args...)  \
 + printk(KERN_ERR [EMAIL PROTECTED]:  str, __func__, __LINE__, ##args)
 +#else
 +#define CR_PRINTK(...)   do {} while (0)
 +#endif
 +

Please use the existing pr_debug and dev_debug here, instead of creating
yet another version.

 +struct cr_hdr_tail {
 + __u32 magic;
 + __u32 cksum[2];
 +};

This structure has an odd multiple of 32-bit members, which means
that if you put it into a larger structure that also contains
64-bit members, the larger structure may get different alignment
on x86-32 and x86-64, which you might want to avoid.
I can't tell if this is an actual problem here.

 +
 +struct cr_hdr_task {
 + __u64 state;
 + __u32 exit_state;
 + __u32 exit_code, exit_signal;
 +
 + __u16 pid;
 + __u16 tgid;
 +
 + __u64 utime, stime, utimescaled, stimescaled;
 + __u64 gtime;
 + __u64 prev_utime, prev_stime;
 + __u64 nvcsw, nivcsw;
 + __u64 start_time_sec, start_time_nsec;
 + __u64 real_start_time_sec, real_start_time_nsec;
 + __u64 min_flt, maj_flt;
 +
 + __s16 task_comm_len;
 + char comm[TASK_COMM_LEN];
 +};

In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
different from x86-64, since it will be 32-bit aligned on x86-32.

 +
 +/*
 + * During restart the code reads in data from the chekcpoint image into a
 + * temporary buffer (ctx-hbuf). Because operations can be nested, one
 + * should call cr_hbuf_get() to reserve space in the buffer, and then
 + * cr_hbuf_put() when it no longer needs that space
 + */
 +
 +#include linux/version.h
 +#include linux/sched.h
 +#include linux/file.h
 +
 +#include ckpt.h
 +#include ckpt_hdr.h
 +
 +/**
 + * cr_hbuf_get - reserve space on the hbuf
 + * @ctx: checkpoint context
 + * @n: number of bytes to reserve
 + */
 +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
 +{
 + void *ptr;
 +
 + BUG_ON(ctx-hpos + n  CR_HBUF_TOTAL);
 + ptr = (void *) (((char *) ctx-hbuf) + ctx-hpos);
 + ctx-hpos += n;
 + return ptr;
 +}

Can (ctx-hpos + n  CR_HBUF_TOTAL) be controlled by the input
data? If so, this is a denial-of-service attack.

 +
 +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
 +{
 + mm_segment_t oldfs;
 + int ret;
 +
 + oldfs = get_fs();
 + set_fs(KERNEL_DS);
 + ret = cr_uwrite(ctx, buf, count);
 + set_fs(oldfs);
 +
 + return ret;
 +}

get_fs()/set_fs() always feels a bit ouch, and this way you have
to use __force to avoid the warnings about __user pointer casts
in sparse.
I wonder if you can use splice_read/splice_write to get around
this problem.

 + struct cr_ctx *ctx;
 + struct file *file;
 + int fput_needed;
 + int ret;
 +
 + if (!capable(CAP_SYS_ADMIN))
 + return -EPERM;
 +

Why do you need CAP_SYS_ADMIN for this? Can't regular users
be allowed to checkpoint/restart their own tasks?

 --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps  
 2008-08-05 09:04:27.0 -0700
 +++ linux-2.6.git-dave/Makefile   2008-08-05 09:07:53.0 -0700
 @@ -611,7 +611,7 @@ export mod_strip_cmd
  
  
  ifeq ($(KBUILD_EXTMOD),)
 -core-y   += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
 +core-y   += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/
  

The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead?

Arnd 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Dave Hansen
On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
 On Friday 08 August 2008, Dave Hansen wrote:
  +   hh-magic = 0x00a2d200;
  +   hh-major = (LINUX_VERSION_CODE  16)  0xff;
  +   hh-minor = (LINUX_VERSION_CODE  8)  0xff;
  +   hh-patch = (LINUX_VERSION_CODE)  0xff;
...
  +}
 
 Do you rely on the kernel version in order to determine the format
 of the binary data, or is it just informational?
 
 If you think the format can change in incompatible ways, you
 probably need something more specific than the version number
 these days, because there are just so many different trees with
 the same numbers.

Yeah, this is very true.  My guess is that we'll need something like
what we do with modversions. 

  +/* debugging */
  +#if 0
  +#define CR_PRINTK(str, args...)  \
  +   printk(KERN_ERR [EMAIL PROTECTED]:  str, __func__, __LINE__, ##args)
  +#else
  +#define CR_PRINTK(...) do {} while (0)
  +#endif
  +
 
 Please use the existing pr_debug and dev_debug here, instead of creating
 yet another version.

Sure thing.  Will do.

  +struct cr_hdr_tail {
  +   __u32 magic;
  +   __u32 cksum[2];
  +};
 
 This structure has an odd multiple of 32-bit members, which means
 that if you put it into a larger structure that also contains
 64-bit members, the larger structure may get different alignment
 on x86-32 and x86-64, which you might want to avoid.
 I can't tell if this is an actual problem here.

Can't we just declare all these things __packed__ and stop worrying
about aligning them all manually?

  +
  +/*
  + * During restart the code reads in data from the chekcpoint image into a
  + * temporary buffer (ctx-hbuf). Because operations can be nested, one
  + * should call cr_hbuf_get() to reserve space in the buffer, and then
  + * cr_hbuf_put() when it no longer needs that space
  + */
  +
  +#include linux/version.h
  +#include linux/sched.h
  +#include linux/file.h
  +
  +#include ckpt.h
  +#include ckpt_hdr.h
  +
  +/**
  + * cr_hbuf_get - reserve space on the hbuf
  + * @ctx: checkpoint context
  + * @n: number of bytes to reserve
  + */
  +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
  +{
  +   void *ptr;
  +
  +   BUG_ON(ctx-hpos + n  CR_HBUF_TOTAL);
  +   ptr = (void *) (((char *) ctx-hbuf) + ctx-hpos);
  +   ctx-hpos += n;
  +   return ptr;
  +}
 
 Can (ctx-hpos + n  CR_HBUF_TOTAL) be controlled by the input
 data? If so, this is a denial-of-service attack.

Ugh, this is crappy code anyway.  It needs to return an error and have
someone else handle it.

  +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
  +{
  +   mm_segment_t oldfs;
  +   int ret;
  +
  +   oldfs = get_fs();
  +   set_fs(KERNEL_DS);
  +   ret = cr_uwrite(ctx, buf, count);
  +   set_fs(oldfs);
  +
  +   return ret;
  +}
 
 get_fs()/set_fs() always feels a bit ouch, and this way you have
 to use __force to avoid the warnings about __user pointer casts
 in sparse.
 I wonder if you can use splice_read/splice_write to get around
 this problem.

I have to wonder if this is just a symptom of us trying to do this the
wrong way.  We're trying to talk the kernel into writing internal gunk
into a FD.  You're right, it is like a splice where one end of the pipe
is in the kernel.

Any thoughts on a better way to do this?  

  +   struct cr_ctx *ctx;
  +   struct file *file;
  +   int fput_needed;
  +   int ret;
  +
  +   if (!capable(CAP_SYS_ADMIN))
  +   return -EPERM;
  +
 
 Why do you need CAP_SYS_ADMIN for this? Can't regular users
 be allowed to checkpoint/restart their own tasks?

Yes, eventually.  I think one good point is that we should probably
remove this now so that we *have* to think about security implications
as we add each individual patch.  For instance, what kind of checking do
we do when we restore an mlock()'d VMA?

I'll pull this check out so it causes pain. (the good kind)

  --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps
  2008-08-05 09:04:27.0 -0700
  +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.0 -0700
  @@ -611,7 +611,7 @@ export mod_strip_cmd
   
   
   ifeq ($(KBUILD_EXTMOD),)
  -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
  +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/
   
 
 The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead?

Fine with me.  Renamed in new patches, hopefully.

I'll send new patches out later today.

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Oren Laadan


Dave Hansen wrote:
 On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
 On Friday 08 August 2008, Dave Hansen wrote:
 +   hh-magic = 0x00a2d200;
 +   hh-major = (LINUX_VERSION_CODE  16)  0xff;
 +   hh-minor = (LINUX_VERSION_CODE  8)  0xff;
 +   hh-patch = (LINUX_VERSION_CODE)  0xff;
 ...
 +}
 Do you rely on the kernel version in order to determine the format
 of the binary data, or is it just informational?

 If you think the format can change in incompatible ways, you
 probably need something more specific than the version number
 these days, because there are just so many different trees with
 the same numbers.
 
 Yeah, this is very true.  My guess is that we'll need something like
 what we do with modversions. 

Exactly. The header should eventually contain sufficient information
to describe the kernel version, configuration, compiler, cpu (arch and
capabilities), and checkpoint code version.

How would you suggest to identify the origin tree with an identifier
(or a text field) in the header ?

 
 +/* debugging */
 +#if 0
 +#define CR_PRINTK(str, args...)  \
 +   printk(KERN_ERR [EMAIL PROTECTED]:  str, __func__, __LINE__, ##args)
 +#else
 +#define CR_PRINTK(...) do {} while (0)
 +#endif
 +
 Please use the existing pr_debug and dev_debug here, instead of creating
 yet another version.
 
 Sure thing.  Will do.
 
 +struct cr_hdr_tail {
 +   __u32 magic;
 +   __u32 cksum[2];
 +};
 This structure has an odd multiple of 32-bit members, which means
 that if you put it into a larger structure that also contains
 64-bit members, the larger structure may get different alignment
 on x86-32 and x86-64, which you might want to avoid.
 I can't tell if this is an actual problem here.
 
 Can't we just declare all these things __packed__ and stop worrying
 about aligning them all manually?
 
 +
 +/*
 + * During restart the code reads in data from the chekcpoint image into a
 + * temporary buffer (ctx-hbuf). Because operations can be nested, one
 + * should call cr_hbuf_get() to reserve space in the buffer, and then
 + * cr_hbuf_put() when it no longer needs that space
 + */
 +
 +#include linux/version.h
 +#include linux/sched.h
 +#include linux/file.h
 +
 +#include ckpt.h
 +#include ckpt_hdr.h
 +
 +/**
 + * cr_hbuf_get - reserve space on the hbuf
 + * @ctx: checkpoint context
 + * @n: number of bytes to reserve
 + */
 +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
 +{
 +   void *ptr;
 +
 +   BUG_ON(ctx-hpos + n  CR_HBUF_TOTAL);
 +   ptr = (void *) (((char *) ctx-hbuf) + ctx-hpos);
 +   ctx-hpos += n;
 +   return ptr;
 +}
 Can (ctx-hpos + n  CR_HBUF_TOTAL) be controlled by the input
 data? If so, this is a denial-of-service attack.
 
 Ugh, this is crappy code anyway.  It needs to return an error and have
 someone else handle it.

Actually not quite. 'n' is _not_ controlled by the input data, and at
the same time ctx-hpos should always carry enough room by design. If
that is not the case, then it's a logical bug, not DoS attack.

To avoid repetitive malloc/free, ctx-hbuf is a buffer to host headers
as they are read; since headers can be read in a nested manner, ctx-hpos
points to the next free position in that buffer. So 'n' is the size of
the header that we are about to read - decided at compile time, not the
user input. The BUG_ON() statement asserts that by design we have enough
buffer (like you'd check that you didn't run out of kernel stack...)

If it is preferred, we can change this to write a kernel message and
return a special error telling that a logical error has occurred.

 
 +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
 +{
 +   mm_segment_t oldfs;
 +   int ret;
 +
 +   oldfs = get_fs();
 +   set_fs(KERNEL_DS);
 +   ret = cr_uwrite(ctx, buf, count);
 +   set_fs(oldfs);
 +
 +   return ret;
 +}
 get_fs()/set_fs() always feels a bit ouch, and this way you have
 to use __force to avoid the warnings about __user pointer casts
 in sparse.
 I wonder if you can use splice_read/splice_write to get around
 this problem.
 
 I have to wonder if this is just a symptom of us trying to do this the
 wrong way.  We're trying to talk the kernel into writing internal gunk
 into a FD.  You're right, it is like a splice where one end of the pipe
 is in the kernel.
 
 Any thoughts on a better way to do this?  
 
 +   struct cr_ctx *ctx;
 +   struct file *file;
 +   int fput_needed;
 +   int ret;
 +
 +   if (!capable(CAP_SYS_ADMIN))
 +   return -EPERM;
 +
 Why do you need CAP_SYS_ADMIN for this? Can't regular users
 be allowed to checkpoint/restart their own tasks?
 
 Yes, eventually.  I think one good point is that we should probably
 remove this now so that we *have* to think about security implications
 as we add each individual patch.  For instance, what kind of checking do
 we do when we restore an mlock()'d VMA?
 
 I'll pull this check out so it causes pain. (the good kind)

Hmmm... even if not strictly now, we *will* need admin privileges for
the CR operations, for the following 

[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Arnd Bergmann
On Friday 08 August 2008, Dave Hansen wrote:
 On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:

   +struct cr_hdr_tail {
   + __u32 magic;
   + __u32 cksum[2];
   +};
  
  This structure has an odd multiple of 32-bit members, which means
  that if you put it into a larger structure that also contains
  64-bit members, the larger structure may get different alignment
  on x86-32 and x86-64, which you might want to avoid.
  I can't tell if this is an actual problem here.
 
 Can't we just declare all these things __packed__ and stop worrying
 about aligning them all manually?

I personally dislike __packed__ because it makes it very easy to get
suboptimal object code. If you either pad every structure to a multiple
of 64 bits or avoid __u64 members, you don't have a problem. Also,
I think avoiding implicit padding inside of data structures is very
helpful for user interfaces, if necessary you can always add explicit
padding.

  get_fs()/set_fs() always feels a bit ouch, and this way you have
  to use __force to avoid the warnings about __user pointer casts
  in sparse.
  I wonder if you can use splice_read/splice_write to get around
  this problem.
 
 I have to wonder if this is just a symptom of us trying to do this the
 wrong way.  We're trying to talk the kernel into writing internal gunk
 into a FD.  You're right, it is like a splice where one end of the pipe
 is in the kernel.
 
 Any thoughts on a better way to do this?  

Maybe you can invert the logic and let the new syscalls create a file
descriptor, and then have user space read or splice the checkpoint
data from it, and restore it by writing to the file descriptor.
It's probably easy to do using anon_inode_getfd() and would solve this
problem, but at the same time make checkpointing the current thread
hard if not impossible.

 Yes, eventually.  I think one good point is that we should probably
 remove this now so that we *have* to think about security implications
 as we add each individual patch.  For instance, what kind of checking do
 we do when we restore an mlock()'d VMA?

I think the question can be generalized further: How do you deal with
saved tasks that have more priviledges than the task doing the restore?

There are probably more, but what I can think of right now includes:
* anything you can set using ulimit
* capabilities
* threads running as another user/group
* open files that have had their permissions changed after the open

Arnd 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Dave Hansen
On Fri, 2008-08-08 at 16:59 -0400, Oren Laadan wrote:
 To avoid repetitive malloc/free, ctx-hbuf is a buffer to host headers
 as they are read;

kmalloc/kfree() area really, really fast.  I wonder if the code gets
easier or harder to read if we just alloc/free as we need to.

How large are these allocations, usually?  Will stack allocation work in
most cases?

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Arnd Bergmann
On Friday 08 August 2008, Oren Laadan wrote:

  Yeah, this is very true.  My guess is that we'll need something like
  what we do with modversions. 
 
 Exactly. The header should eventually contain sufficient information
 to describe the kernel version, configuration, compiler, cpu (arch and
 capabilities), and checkpoint code version.
 
 How would you suggest to identify the origin tree with an identifier
 (or a text field) in the header ?

Including struct utsname in the header covers most of this. I supposed
you can't do it entirely safe, and you always need to be prepared for
malicious input data, so there probably isn't much point in getting
too fancy beyond what you need to do to prevent user errors.

 If it is preferred, we can change this to write a kernel message and
 return a special error telling that a logical error has occurred.

My recommendation in general is to make kernel code crash loudly if
there is a bug in the kernel itself. Returning error codes makes most
sense if they get sent back to the user, which then can make sense of
it, as documented in the man page of the syscall.

  Yes, eventually.  I think one good point is that we should probably
  remove this now so that we *have* to think about security implications
  as we add each individual patch.  For instance, what kind of checking do
  we do when we restore an mlock()'d VMA?
  
  I'll pull this check out so it causes pain. (the good kind)
 
 Hmmm... even if not strictly now, we *will* need admin privileges for
 the CR operations, for the following reasons:
 
 checkpoint: we save the entire state of a set of processes to a file - so
 we must have privileges to do so, at least within (or with respect to) the
 said container. Even if we are the user who owns the container, we'll need
 root access within that container.
 
 restart: we restore the entire set of a set of processes, which may require
 some privileged operations (again, at least within or with respect to the
 said container). Otherwise any user could inject any restart data into the
 kernel and create any set of processes with arbitrary permissions.
 
 In a sense, we need something similar to granting ptrace access.

Exactly. There was a project that implemented checkpoint/restart through
ptrace (don't remember what it was called), so with certain limitations
it should also be possible to implement the syscalls so that any user that
can ptrace the tasks can also checkpoint them.

Arnd 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Dave Hansen
On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote:
  I have to wonder if this is just a symptom of us trying to do this the
  wrong way.  We're trying to talk the kernel into writing internal gunk
  into a FD.  You're right, it is like a splice where one end of the pipe
  is in the kernel.
  
  Any thoughts on a better way to do this?  
 
 Maybe you can invert the logic and let the new syscalls create a file
 descriptor, and then have user space read or splice the checkpoint
 data from it, and restore it by writing to the file descriptor.
 It's probably easy to do using anon_inode_getfd() and would solve this
 problem, but at the same time make checkpointing the current thread
 hard if not impossible.

Yeah, it does seem kinda backwards.  But, instead of even having to
worry about the anon_inode stuff, why don't we just put it in a fs like
everything else?  checkpointfs!

I'm also really not convinced that putting the entire checkpoint in one
glob is really the solution, either.  I mean, is system call overhead
really a problem here?

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure

2008-08-08 Thread Arnd Bergmann
On Saturday 09 August 2008, Dave Hansen wrote:
 On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote:

  Maybe you can invert the logic and let the new syscalls create a file
  descriptor, and then have user space read or splice the checkpoint
  data from it, and restore it by writing to the file descriptor.
  It's probably easy to do using anon_inode_getfd() and would solve this
  problem, but at the same time make checkpointing the current thread
  hard if not impossible.
 
 Yeah, it does seem kinda backwards.  But, instead of even having to
 worry about the anon_inode stuff, why don't we just put it in a fs like
 everything else?  checkpointfs!

Well, anon_inodes are really easy to use and have replaced some of the
simple non-mountable file systems in the kernel.

checkpointfs sounds interesting and I guess in a plan9 world of fairies
and fantasy, you should be able to create a checkpoint of your system using
'tar czf - /proc/', but I'm not sure it helps here.

The main problem I see with that would be atomicity: If you want multiple
processes to keep interacting with each other, you need to save them at
the same point in time, which gets harder as you split your interface into
more than a single file descriptor.

Arnd 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel