Re: [PATCH v1 1/2] open: add close_range()

2019-05-24 Thread Christian Brauner
On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() 
> syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> >  /* that exec is sensitive */
> >  unshare(CLONE_FILES);
> >  /* we don't want anything past stderr here */
> >  close_range(3, ~0U);
> >  execve();
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> >   task is multi-threaded and shares the file descriptor table with another
> >   thread in which case two threads could race with one thread allocating
> >   file descriptors and the other one closing them via close_range(). For the
> >   general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc//fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >- Python (cf. [2])
> >- Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> >  int dir_fd;
> >  DIR *dir;
> >  struct dirent *direntp;
> >
> >  dir = opendir("/proc/self/fd");
> >  if (!dir)
> >  return -1;
> >  dir_fd = dirfd(dir);
> >  while ((direntp = readdir(dir))) {
> >  int fd;
> >  if (strcmp(direntp->d_name, ".") == 0)
> >  continue;
> >  if (strcmp(direntp->d_name, "..") == 0)
> >  continue;
> >  fd = atoi(direntp->d_name);
> >  if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> >  continue;
> >  close(fd);
> >  }
> >  closedir(dir);
> >  return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> > - close_all_fds(): ~280 us
> > - close_range():~24 us
> >
> > 2. closing 1000 open files:
> > - close_all_fds(): ~5000 us
> > - close_range():   ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
> 
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
> 
> Here is another strange but real-live scenario: crash handler for dumping 
> core.
> 
> If applications has network connections it would be better to close them all,
> otherwise clients will wait until end of dumping process or timeout.
> Also closing normal files might be a good idea for releasing locks.

Re: [PATCH v1 1/2] open: add close_range()

2019-05-24 Thread Christian Brauner
On Fri, May 24, 2019 at 09:43:53AM +0200, Arnd Bergmann wrote:
> On Thu, May 23, 2019 at 6:33 PM Christian Brauner  
> wrote:
> > On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> > > On 22.05.2019 18:52, Christian Brauner wrote:> This adds the 
> > > close_range() syscall. It allows to efficiently close a range
> > > >   22 files changed, 100 insertions(+), 9 deletions(-)
> > > >
> > >
> > > It would be better to split arch/ wiring into separate patch for better 
> > > readability.
> >
> > Ok. You mean only do x86 - seems to be the standard - and then move the
> > others into a separate patch? Doesn't seem worth to have a patch
> > per-arch, I'd think.
> 
> I think I would prefer the first patch to just add the call without wiring it 
> up
> anywhere, and a second patch do add it on all architectures including x86.

I've split this into two patches and also bumped arm64
__NR_compat_syscalls that I've missed before as you mentioned!

Thanks!
Christian


Re: [PATCH v1 1/2] open: add close_range()

2019-05-24 Thread Arnd Bergmann
On Thu, May 23, 2019 at 6:33 PM Christian Brauner  wrote:
> On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> > On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() 
> > syscall. It allows to efficiently close a range
> > >   22 files changed, 100 insertions(+), 9 deletions(-)
> > >
> >
> > It would be better to split arch/ wiring into separate patch for better 
> > readability.
>
> Ok. You mean only do x86 - seems to be the standard - and then move the
> others into a separate patch? Doesn't seem worth to have a patch
> per-arch, I'd think.

I think I would prefer the first patch to just add the call without wiring it up
anywhere, and a second patch do add it on all architectures including x86.

 Arnd


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() 
> syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> >  /* that exec is sensitive */
> >  unshare(CLONE_FILES);
> >  /* we don't want anything past stderr here */
> >  close_range(3, ~0U);
> >  execve();
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> >   task is multi-threaded and shares the file descriptor table with another
> >   thread in which case two threads could race with one thread allocating
> >   file descriptors and the other one closing them via close_range(). For the
> >   general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc//fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >- Python (cf. [2])
> >- Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> >  int dir_fd;
> >  DIR *dir;
> >  struct dirent *direntp;
> >
> >  dir = opendir("/proc/self/fd");
> >  if (!dir)
> >  return -1;
> >  dir_fd = dirfd(dir);
> >  while ((direntp = readdir(dir))) {
> >  int fd;
> >  if (strcmp(direntp->d_name, ".") == 0)
> >  continue;
> >  if (strcmp(direntp->d_name, "..") == 0)
> >  continue;
> >  fd = atoi(direntp->d_name);
> >  if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> >  continue;
> >  close(fd);
> >  }
> >  closedir(dir);
> >  return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> > - close_all_fds(): ~280 us
> > - close_range():~24 us
> >
> > 2. closing 1000 open files:
> > - close_all_fds(): ~5000 us
> > - close_range():   ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
> 
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
> 
> Here is another strange but real-live scenario: crash handler for dumping 
> core.
> 
> If applications has network connections it would be better to close them all,
> otherwise clients will wait until end of dumping process or timeout.
> Also closing normal files might be a good idea for releasing locks.

RE: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread David Laight
From:  Konstantin Khlebnikov
> Sent: 23 May 2019 17:22

>  > In addition, the syscall will also work for tasks that do not have procfs
>  > mounted and on kernels that do not have procfs support compiled in. In such
>  > situations the only way to make sure that all file descriptors are closed
>  > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
>  > OPEN_MAX trickery (cf. comment [8] on Rust).

Code using RLIMIT_NOFILE is broken.
It is easy to reduce the hard limit below that of an open fd.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Konstantin Khlebnikov

On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() 
syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
>  /* that exec is sensitive */
>  unshare(CLONE_FILES);
>  /* we don't want anything past stderr here */
>  close_range(3, ~0U);
>  execve();
>
> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
>   task is multi-threaded and shares the file descriptor table with another
>   thread in which case two threads could race with one thread allocating
>   file descriptors and the other one closing them via close_range(). For the
>   general case close_range() before the execve() is sufficient.)
>
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc//fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
>- Python (cf. [2])
>- Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).
>
> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
>
> static int close_all_fds(void)
> {
>  int dir_fd;
>  DIR *dir;
>  struct dirent *direntp;
>
>  dir = opendir("/proc/self/fd");
>  if (!dir)
>  return -1;
>  dir_fd = dirfd(dir);
>  while ((direntp = readdir(dir))) {
>  int fd;
>  if (strcmp(direntp->d_name, ".") == 0)
>  continue;
>  if (strcmp(direntp->d_name, "..") == 0)
>  continue;
>  fd = atoi(direntp->d_name);
>  if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
>  continue;
>  close(fd);
>  }
>  closedir(dir);
>  return 0;
> }
>
> to close_range() yields:
> 1. closing 4 open files:
> - close_all_fds(): ~280 us
> - close_range():~24 us
>
> 2. closing 1000 open files:
> - close_all_fds(): ~5000 us
> - close_range():   ~800 us
>
> close_range() is designed to allow for some flexibility. Specifically, it
> does not simply always close all open file descriptors of a task. Instead,
> callers can specify an upper bound.
> This is e.g. useful for scenarios where specific file descriptors are
> created with well-known numbers that are supposed to be excluded from
> getting closed.
> For extra paranoia close_range() comes with a flags argument. This can e.g.
> be used to implement extension. Once can imagine userspace wanting to stop
> at the first error instead of ignoring errors under certain circumstances.

> There might be other valid ideas in the future. In any case, a flag
> argument doesn't hurt and keeps us on the safe side.

Here is another strange but real-live scenario: crash handler for dumping core.

If applications has network connections it would be better to close them all,
otherwise clients will wait until end of dumping process or timeout.
Also closing normal files might be a good idea for releasing locks.

But simple closing might race with other threads - closed fd will be reused
while some code still thinks it refers to original file.

Our solution closes files without freeing fd: it opens /dev/null and
replaces all opened descriptors using dup2.

So, special flag for 

Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Oleg Nesterov
On 05/23, Christian Brauner wrote:
>
> So given that we would really need another find_next_open_fd() I think
> sticking to the simple cond_resched() version I sent before is better
> for now until we see real-world performance issues.

OK, agreed.

Oleg.



Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 04:32:14PM +0200, Jann Horn wrote:
> On Thu, May 23, 2019 at 1:51 PM Christian Brauner  
> wrote:
> [...]
> > I kept it dumb and was about to reply that your solution introduces more
> > code when it seemed we wanted to keep this very simple for now.
> > But then I saw that find_next_opened_fd() already exists as
> > find_next_fd(). So it's actually not bad compared to what I sent in v1.
> > So - with some small tweaks (need to test it and all now) - how do we
> > feel about?:
> [...]
> > static int __close_next_open_fd(struct files_struct *files, unsigned 
> > *curfd, unsigned maxfd)
> > {
> > struct file *file = NULL;
> > unsigned fd;
> > struct fdtable *fdt;
> >
> > spin_lock(>file_lock);
> > fdt = files_fdtable(files);
> > fd = find_next_fd(fdt, *curfd);
> 
> find_next_fd() finds free fds, not used ones.
> 
> > if (fd >= fdt->max_fds || fd > maxfd)
> > goto out_unlock;
> >
> > file = fdt->fd[fd];
> > rcu_assign_pointer(fdt->fd[fd], NULL);
> > __put_unused_fd(files, fd);
> 
> You can't do __put_unused_fd() if the old pointer in fdt->fd[fd] was
> NULL - because that means that the fd has been reserved by another
> thread that is about to put a file pointer in there, and if you
> release the fd here, that messes up the refcounting (or hits the
> BUG_ON() in __fd_install()).
> 
> > out_unlock:
> > spin_unlock(>file_lock);
> >
> > if (!file)
> > return -EBADF;
> >
> > *curfd = fd;
> > filp_close(file, files);
> > return 0;
> > }
> >
> > int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > {
> > if (fd > max_fd)
> > return -EINVAL;
> >
> > while (fd <= max_fd) {
> 
> Note that with a pattern like this, you have to be careful about what
> happens if someone gives you max_fd==0x - then this condition
> is always true and the loop can not terminate this way.
> 
> > if (__close_next_fd(files, , maxfd))
> > break;
> 
> (obviously it can still terminate this way)

Yup, this was only a quick draft.
I think the dumb simple thing that I did before was the best way to do
it for now.
I first thought that the find_next_open_fd() function already exists but
when I went to write a POC for testing realized it doesn't anyway.


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Jann Horn
On Thu, May 23, 2019 at 1:51 PM Christian Brauner  wrote:
[...]
> I kept it dumb and was about to reply that your solution introduces more
> code when it seemed we wanted to keep this very simple for now.
> But then I saw that find_next_opened_fd() already exists as
> find_next_fd(). So it's actually not bad compared to what I sent in v1.
> So - with some small tweaks (need to test it and all now) - how do we
> feel about?:
[...]
> static int __close_next_open_fd(struct files_struct *files, unsigned *curfd, 
> unsigned maxfd)
> {
> struct file *file = NULL;
> unsigned fd;
> struct fdtable *fdt;
>
> spin_lock(>file_lock);
> fdt = files_fdtable(files);
> fd = find_next_fd(fdt, *curfd);

find_next_fd() finds free fds, not used ones.

> if (fd >= fdt->max_fds || fd > maxfd)
> goto out_unlock;
>
> file = fdt->fd[fd];
> rcu_assign_pointer(fdt->fd[fd], NULL);
> __put_unused_fd(files, fd);

You can't do __put_unused_fd() if the old pointer in fdt->fd[fd] was
NULL - because that means that the fd has been reserved by another
thread that is about to put a file pointer in there, and if you
release the fd here, that messes up the refcounting (or hits the
BUG_ON() in __fd_install()).

> out_unlock:
> spin_unlock(>file_lock);
>
> if (!file)
> return -EBADF;
>
> *curfd = fd;
> filp_close(file, files);
> return 0;
> }
>
> int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> {
> if (fd > max_fd)
> return -EINVAL;
>
> while (fd <= max_fd) {

Note that with a pattern like this, you have to be careful about what
happens if someone gives you max_fd==0x - then this condition
is always true and the loop can not terminate this way.

> if (__close_next_fd(files, , maxfd))
> break;

(obviously it can still terminate this way)

> cond_resched();
> fd++;
> }
>
> return 0;
> }


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 04:14:47PM +0200, Christian Brauner wrote:
> On Thu, May 23, 2019 at 01:51:18PM +0200, Christian Brauner wrote:
> > On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> > > On 05/22, Christian Brauner wrote:
> > > >
> > > > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> > > >  {
> > > > -   struct file *file;
> > > > +   struct file *file = NULL;
> > > > struct fdtable *fdt;
> > > >  
> > > > spin_lock(>file_lock);
> > > > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, 
> > > > unsigned fd)
> > > > goto out_unlock;
> > > > rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > __put_unused_fd(files, fd);
> > > > -   spin_unlock(>file_lock);
> > > > -   return filp_close(file, files);
> > > >  
> > > >  out_unlock:
> > > > spin_unlock(>file_lock);
> > > > -   return -EBADF;
> > > > +   return file;
> > > 
> > > ...
> > > 
> > > > +int __close_range(struct files_struct *files, unsigned fd, unsigned 
> > > > max_fd)
> > > > +{
> > > > +   unsigned int cur_max;
> > > > +
> > > > +   if (fd > max_fd)
> > > > +   return -EINVAL;
> > > > +
> > > > +   rcu_read_lock();
> > > > +   cur_max = files_fdtable(files)->max_fds;
> > > > +   rcu_read_unlock();
> > > > +
> > > > +   /* cap to last valid index into fdtable */
> > > > +   if (max_fd >= cur_max)
> > > > +   max_fd = cur_max - 1;
> > > > +
> > > > +   while (fd <= max_fd) {
> > > > +   struct file *file;
> > > > +
> > > > +   file = pick_file(files, fd++);
> > > 
> > > Well, how about something like
> > > 
> > >   static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned 
> > > start)
> > >   {
> > >   unsigned int maxfd = fdt->max_fds;
> > >   unsigned int maxbit = maxfd / BITS_PER_LONG;
> > >   unsigned int bitbit = start / BITS_PER_LONG;
> > > 
> > >   bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * 
> > > BITS_PER_LONG;
> > >   if (bitbit > maxfd)
> > >   return maxfd;
> > >   if (bitbit > start)
> > >   start = bitbit;
> > >   return find_next_bit(fdt->open_fds, maxfd, start);
> > >   }
> > 
> > > 
> > >   unsigned close_next_fd(struct files_struct *files, unsigned start, 
> > > unsigned maxfd)
> > >   {
> > >   unsigned fd;
> > >   struct file *file;
> > >   struct fdtable *fdt;
> > >   
> > >   spin_lock(>file_lock);
> > >   fdt = files_fdtable(files);
> > >   fd = find_next_opened_fd(fdt, start);
> > >   if (fd >= fdt->max_fds || fd > maxfd) {
> > >   fd = -1;
> > >   goto out;
> > >   }
> > > 
> > >   file = fdt->fd[fd];
> > >   rcu_assign_pointer(fdt->fd[fd], NULL);
> > >   __put_unused_fd(files, fd);
> > >   out:
> > >   spin_unlock(>file_lock);
> > > 
> > >   if (fd == -1u)
> > >   return fd;
> > > 
> > >   filp_close(file, files);
> > >   return fd + 1;
> > >   }
> > 
> > Thanks, Oleg!
> > 
> > I kept it dumb and was about to reply that your solution introduces more
> > code when it seemed we wanted to keep this very simple for now.
> > But then I saw that find_next_opened_fd() already exists as
> > find_next_fd(). So it's actually not bad compared to what I sent in v1.
> > So - with some small tweaks (need to test it and all now) - how do we
> > feel about?:
> 
> That's obviously not correct atm but I'll send out a tweaked version in
> a bit.

So given that we would really need another find_next_open_fd() I think
sticking to the simple cond_resched() version I sent before is better
for now until we see real-world performance issues.
I was however missing a test for close_range(fd, fd, 0) anyway so I'll
need to send a v2 with this test added.

Christian


Re: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread Christian Brauner
On Thu, May 23, 2019 at 01:51:18PM +0200, Christian Brauner wrote:
> On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> > On 05/22, Christian Brauner wrote:
> > >
> > > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> > >  {
> > > - struct file *file;
> > > + struct file *file = NULL;
> > >   struct fdtable *fdt;
> > >  
> > >   spin_lock(>file_lock);
> > > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned 
> > > fd)
> > >   goto out_unlock;
> > >   rcu_assign_pointer(fdt->fd[fd], NULL);
> > >   __put_unused_fd(files, fd);
> > > - spin_unlock(>file_lock);
> > > - return filp_close(file, files);
> > >  
> > >  out_unlock:
> > >   spin_unlock(>file_lock);
> > > - return -EBADF;
> > > + return file;
> > 
> > ...
> > 
> > > +int __close_range(struct files_struct *files, unsigned fd, unsigned 
> > > max_fd)
> > > +{
> > > + unsigned int cur_max;
> > > +
> > > + if (fd > max_fd)
> > > + return -EINVAL;
> > > +
> > > + rcu_read_lock();
> > > + cur_max = files_fdtable(files)->max_fds;
> > > + rcu_read_unlock();
> > > +
> > > + /* cap to last valid index into fdtable */
> > > + if (max_fd >= cur_max)
> > > + max_fd = cur_max - 1;
> > > +
> > > + while (fd <= max_fd) {
> > > + struct file *file;
> > > +
> > > + file = pick_file(files, fd++);
> > 
> > Well, how about something like
> > 
> > static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned 
> > start)
> > {
> > unsigned int maxfd = fdt->max_fds;
> > unsigned int maxbit = maxfd / BITS_PER_LONG;
> > unsigned int bitbit = start / BITS_PER_LONG;
> > 
> > bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * 
> > BITS_PER_LONG;
> > if (bitbit > maxfd)
> > return maxfd;
> > if (bitbit > start)
> > start = bitbit;
> > return find_next_bit(fdt->open_fds, maxfd, start);
> > }
> 
> > 
> > unsigned close_next_fd(struct files_struct *files, unsigned start, 
> > unsigned maxfd)
> > {
> > unsigned fd;
> > struct file *file;
> > struct fdtable *fdt;
> > 
> > spin_lock(>file_lock);
> > fdt = files_fdtable(files);
> > fd = find_next_opened_fd(fdt, start);
> > if (fd >= fdt->max_fds || fd > maxfd) {
> > fd = -1;
> > goto out;
> > }
> > 
> > file = fdt->fd[fd];
> > rcu_assign_pointer(fdt->fd[fd], NULL);
> > __put_unused_fd(files, fd);
> > out:
> > spin_unlock(>file_lock);
> > 
> > if (fd == -1u)
> > return fd;
> > 
> > filp_close(file, files);
> > return fd + 1;
> > }
> 
> Thanks, Oleg!
> 
> I kept it dumb and was about to reply that your solution introduces more
> code when it seemed we wanted to keep this very simple for now.
> But then I saw that find_next_opened_fd() already exists as
> find_next_fd(). So it's actually not bad compared to what I sent in v1.
> So - with some small tweaks (need to test it and all now) - how do we
> feel about?:

That's obviously not correct atm but I'll send out a tweaked version in
a bit.

Christian


Re: [PATCH v1 1/2] open: add close_range()

2019-05-22 Thread Oleg Nesterov
On 05/22, Christian Brauner wrote:
>
> +static struct file *pick_file(struct files_struct *files, unsigned fd)
>  {
> - struct file *file;
> + struct file *file = NULL;
>   struct fdtable *fdt;
>  
>   spin_lock(>file_lock);
> @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
>   goto out_unlock;
>   rcu_assign_pointer(fdt->fd[fd], NULL);
>   __put_unused_fd(files, fd);
> - spin_unlock(>file_lock);
> - return filp_close(file, files);
>  
>  out_unlock:
>   spin_unlock(>file_lock);
> - return -EBADF;
> + return file;

...

> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + if (max_fd >= cur_max)
> + max_fd = cur_max - 1;
> +
> + while (fd <= max_fd) {
> + struct file *file;
> +
> + file = pick_file(files, fd++);

Well, how about something like

static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned 
start)
{
unsigned int maxfd = fdt->max_fds;
unsigned int maxbit = maxfd / BITS_PER_LONG;
unsigned int bitbit = start / BITS_PER_LONG;

bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * 
BITS_PER_LONG;
if (bitbit > maxfd)
return maxfd;
if (bitbit > start)
start = bitbit;
return find_next_bit(fdt->open_fds, maxfd, start);
}

unsigned close_next_fd(struct files_struct *files, unsigned start, 
unsigned maxfd)
{
unsigned fd;
struct file *file;
struct fdtable *fdt;

spin_lock(>file_lock);
fdt = files_fdtable(files);
fd = find_next_opened_fd(fdt, start);
if (fd >= fdt->max_fds || fd > maxfd) {
fd = -1;
goto out;
}

file = fdt->fd[fd];
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
out:
spin_unlock(>file_lock);

if (fd == -1u)
return fd;

filp_close(file, files);
return fd + 1;
}

?

Then close_range() can do

while (fd < max_fd)
fd = close_next_fd(fd, maxfd);

Oleg.