Re: [1/2,v2] fdmap(2)

2017-10-26 Thread Andy Lutomirski
> On Oct 19, 2017, at 5:34 PM, Alexey Dobriyan  wrote:
>
> On 10/18/17, Andy Lutomirski  wrote:
>>> fdmap() is standalone thing.
>>>
>>> Next step is to design fdinfo(2) (?) which uses descriptors from
>>> fdmap(2). Extending structures is done as usual: but version,
>>> add new fields to the end.
>>
>> I very strongly disagree.  If you really want a new interface for
>> reading out information about other processes, design one that makes
>> sense as a whole.  Don't design it piecemeal. The last thing we need
>> is a crappy /proc alternative that covers a small fraction of use
>> cases.
>
> Oh well.
>
>> And demonstrate that it actually has a material benefit over
>> fixing /proc.
>
>> Meanwhile, why not just fix /proc?
>
> /proc can not be fixed. Let me reiterate the arguments one more time.
>
> Using /proc
> * overallocates memory even if temporarily,

This is fixable.  Maybe hard, but fixable.

> * instantiates dentries and inodes nobody cares about,

Your syscalls won't make a difference here because we can't just
remove /proc.  But IIRC we could give it a backing store like sysfs
has.

> * make expansion path more painful than necessary:
>  1) adding field to the end of the file is the least risky move,
>  2) decimal radix and unfixed positioning are used often

These two are fixable by adding extensible binary files.

>  3) read() doesn't accept any sort of filter for data

Indeed.  Doing this directly in /proc is hard.

>
>  (1)+(3) = those who added their fields later eat the cost of all
>  previous fields even if their field is very simple and easy
>  to calculate.
>
>  (2)+(3) = lseek() can not be used as a substitute.
>
>  4) adding new file is cleaner however every virtual file creates
> kernel objects which aren't interesting themselves.
> This has direct costs (more syscalls and lookups) and
> indirect costs (more garbage in d/icache and more flushing
> when the process dies)
>
>  5) converting to strings eats cycles (and more cycles are consumed
> if userspace decides to convert it back to binary for whatever
> reasons)

As above, fixable with new files.

>
> For those who didn't pay attention, first patch to make integer
> conversion faster made /proc/*/stat (?) 30% faster!
> This is how slow text can be.

Alternatively, that's how much improvement is available without any
ABI change at all.

>
> Sysfs is overall better but solely because it has strict one-value-per-file
> rule, so interfaces are cleaner and there is less room for mistakes.
>
>
> Philosophically, text files create false impression that parsing text
> is easy.
>
> As an example grab a bunch of students and ask them to write a program
> which parses something innocent like /proc/*/stat for process state.
>
> The beartrap is that ->comm is not escaped for Space and ')' so naive
> strchr() from the beginning doesn't work reliably. The only reliable
> way is to count spaces from the end(!) and pray kernel devs do not
> extend the file with another field or, worse, with another textual
> field.
>
> ESCAPE_SPACE doesn't escape Space, funny, isn't it?
>
> /proc/*/status kind of has the same problem however forward strstr()
> works with "\nState:\t" because \n and \t will be escaped. But nobody
> would search for just "State:", especially in scripts, right?
>
> EIATF people often critique binary people for their mistakes
> but themselves make equally stupid ones. "(unreachable)" anyone?.
>
> So the answer it not to fix /proc, the answer it to leave /proc alone.
> The answer is make Unix shell people move their lazy asses and
> implement minimal type system and a way to execute raw system calls
> like all normal programming languages do. They still haven't done it
> after dozens of years and are arrogant enough to say "oh uh we can't
> use cat and pipe to awk"

I think that asking the bash maintainers to implement PowerShell does
not fall into the lazy ass category.


Re: [PATCH 1/2 v2] fdmap(2)

2017-10-25 Thread Pavel Machek
On Wed 2017-10-25 14:45:31, Alexey Dobriyan wrote:
> On 10/23/17, Pavel Machek  wrote:
> 
> > Binary fdmap looks... quite ugly to me. But close_all(from, to)
> > syscall kindof makes sense to me... and is not that ugly.
> >
> > Given that openbsd has something similar... perhaps we can take that?
> 
> closefrom() can be implemented on top of fdmap().
> fdmap as is is useful for other purposes.

fdmap can be implemented on top of open('/dev/kmem').
open('/dev/kmem') as is is useful for other purposes.

Yes, fdmap is more powerful. No, that does not mean it is better idea.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [1/2,v2] fdmap(2)

2017-10-25 Thread Alexey Dobriyan
On 10/20/17, Greg KH  wrote:
> On Thu, Oct 19, 2017 at 05:34:35PM +0200, Alexey Dobriyan wrote:
>> So the answer it not to fix /proc, the answer it to leave /proc alone.
>
> No, because:
>
>> The answer is make Unix shell people move their lazy asses and
>> implement minimal type system and a way to execute raw system calls
>> like all normal programming languages do. They still haven't done it
>> after dozens of years and are arrogant enough to say "oh uh we can't
>> use cat and pipe to awk".
>
> Have you submitted changes and patches to those projects in order to
> implement such a thing?

This is open source equivalent of "sod off".

> If not, just complaining about it isn't going to change anything, you
> know better than that :)

Greg, I'm not "just complaining", fdmap() was implemented and
sent for discussion.

Unix shell people seem to me to be anti-intellectual if after all
these years they didn't realize that their core ideas of "universal"
interface and serialization/deserialization at every opportunity
are wrong and stupid. It is best to avoid such people and not
interact with them. However what they can do is to not become
anchors asking for the world to revolve around them.

I'm not proposing to delete /proc, I'm proposing (kind of)
to leave it as is.

P.S.: even Perl 5 has "syscall" function!


Re: [PATCH 1/2 v2] fdmap(2)

2017-10-25 Thread Alexey Dobriyan
On 10/23/17, Pavel Machek  wrote:

> Binary fdmap looks... quite ugly to me. But close_all(from, to)
> syscall kindof makes sense to me... and is not that ugly.
>
> Given that openbsd has something similar... perhaps we can take that?

closefrom() can be implemented on top of fdmap().
fdmap as is is useful for other purposes.

> Except during startup, application should have good idea what fds it
> has open, so maybe that's enough?


Re: [PATCH 1/2 v2] fdmap(2)

2017-10-23 Thread Pavel Machek
Hi!

> > From: Aliaksandr Patseyenak 
> > 
> > Implement system call for bulk retrieveing of opened descriptors
> > in binary form.
> > 
> > Some daemons could use it to reliably close file descriptors
> > before starting. Currently they close everything upto some number
> > which formally is not reliable. Other natural users are lsof(1) and CRIU
> > (although lsof does so much in /proc that the effect is thoroughly buried).
> > 
> > /proc, the only way to learn anything about file descriptors may not be
> > available. There is unavoidable overhead associated with instantiating
> > 3 dentries and 3 inodes and converting integers to strings and back.
> > 
> > Benchmark:
> > 
> > N=1<<22 times
> > 4 opened descriptors (0, 1, 2, 3)
> > opendir+readdir+closedir /proc/self/fd vs fdmap
> > 
> > /proc 8.31 ± 0.37%
> > fdmap 0.32 ± 0.72%
> 
> >From the text above, I'm still trying to understand: whose problem
> does this solve? I mean, we've lived with the daemon-close-all-files
> technique forever (and I'm not sure that performance is really an 
> important issue for the daemon case) . And you say that the effect

Binary fdmap looks... quite ugly to me. But close_all(from, to)
syscall kindof makes sense to me... and is not that ugly.

Given that openbsd has something similar... perhaps we can take that?

Except during startup, application should have good idea what fds it
has open, so maybe that's enough?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [1/2,v2] fdmap(2)

2017-10-20 Thread Greg KH
On Thu, Oct 19, 2017 at 05:34:35PM +0200, Alexey Dobriyan wrote:
> So the answer it not to fix /proc, the answer it to leave /proc alone.

No, because:

> The answer is make Unix shell people move their lazy asses and
> implement minimal type system and a way to execute raw system calls
> like all normal programming languages do. They still haven't done it
> after dozens of years and are arrogant enough to say "oh uh we can't
> use cat and pipe to awk".

Have you submitted changes and patches to those projects in order to
implement such a thing?

If not, just complaining about it isn't going to change anything, you
know better than that :)

sorry,

greg k-h


Re: [1/2,v2] fdmap(2)

2017-10-19 Thread Alexey Dobriyan
On 10/18/17, Andy Lutomirski  wrote:
>> fdmap() is standalone thing.
>>
>> Next step is to design fdinfo(2) (?) which uses descriptors from
>> fdmap(2). Extending structures is done as usual: but version,
>> add new fields to the end.
>
> I very strongly disagree.  If you really want a new interface for
> reading out information about other processes, design one that makes
> sense as a whole.  Don't design it piecemeal. The last thing we need
> is a crappy /proc alternative that covers a small fraction of use
> cases.

Oh well.

> And demonstrate that it actually has a material benefit over
> fixing /proc.

> Meanwhile, why not just fix /proc?

/proc can not be fixed. Let me reiterate the arguments one more time.

Using /proc
* overallocates memory even if temporarily,
* instantiates dentries and inodes nobody cares about,
* make expansion path more painful than necessary:
  1) adding field to the end of the file is the least risky move,
  2) decimal radix and unfixed positioning are used often
  3) read() doesn't accept any sort of filter for data

  (1)+(3) = those who added their fields later eat the cost of all
  previous fields even if their field is very simple and easy
  to calculate.

  (2)+(3) = lseek() can not be used as a substitute.

  4) adding new file is cleaner however every virtual file creates
 kernel objects which aren't interesting themselves.
 This has direct costs (more syscalls and lookups) and
 indirect costs (more garbage in d/icache and more flushing
 when the process dies)

  5) converting to strings eats cycles (and more cycles are consumed
 if userspace decides to convert it back to binary for whatever
 reasons)

 For those who didn't pay attention, first patch to make integer
 conversion faster made /proc/*/stat (?) 30% faster!
 This is how slow text can be.

Sysfs is overall better but solely because it has strict one-value-per-file
rule, so interfaces are cleaner and there is less room for mistakes.


Philosophically, text files create false impression that parsing text
is easy.

As an example grab a bunch of students and ask them to write a program
which parses something innocent like /proc/*/stat for process state.

The beartrap is that ->comm is not escaped for Space and ')' so naive
strchr() from the beginning doesn't work reliably. The only reliable
way is to count spaces from the end(!) and pray kernel devs do not
extend the file with another field or, worse, with another textual
field.

ESCAPE_SPACE doesn't escape Space, funny, isn't it?

/proc/*/status kind of has the same problem however forward strstr()
works with "\nState:\t" because \n and \t will be escaped. But nobody
would search for just "State:", especially in scripts, right?

EIATF people often critique binary people for their mistakes
but themselves make equally stupid ones. "(unreachable)" anyone?.

So the answer it not to fix /proc, the answer it to leave /proc alone.
The answer is make Unix shell people move their lazy asses and
implement minimal type system and a way to execute raw system calls
like all normal programming languages do. They still haven't done it
after dozens of years and are arrogant enough to say "oh uh we can't
use cat and pipe to awk".


Re: [1/2,v2] fdmap(2)

2017-10-18 Thread Andy Lutomirski
> On Oct 18, 2017, at 4:35 AM, Alexey Dobriyan  wrote:
>
>> On 10/12/17, Andrei Vagin  wrote:
>>
>> I'm agree with your points, but I think you choose a wrong set of data
>> to make an example of a new approach.
>>
>> You are talking a lot about statx, but for me it is unclear how fdmap
>> follows the idea of statx. Let's imagine that I want to extend fdmap to
>> return mnt_id for each file descriptor?
>
> fdmap() is standalone thing.
>
> Next step is to design fdinfo(2) (?) which uses descriptors from fdmap(2).
> Extending structures is done as usual: but version, add new fields to the end.
>

I very strongly disagree.  If you really want a new interface for
reading out information about other processes, design one that makes
sense as a whole.  Don't design it piecemeal. The last thing we need
is a crappy /proc alternative that covers a small fraction of use
cases.  And demonstrate that it actually has a material benefit over
fixing /proc.

Meanwhile, why not just fix /proc?


Re: [1/2,v2] fdmap(2)

2017-10-18 Thread Alexey Dobriyan
On 10/12/17, Andrei Vagin  wrote:

> I'm agree with your points, but I think you choose a wrong set of data
> to make an example of a new approach.
>
> You are talking a lot about statx, but for me it is unclear how fdmap
> follows the idea of statx. Let's imagine that I want to extend fdmap to
> return mnt_id for each file descriptor?

fdmap() is standalone thing.

Next step is to design fdinfo(2) (?) which uses descriptors from fdmap(2).
Extending structures is done as usual: but version, add new fields to the end.

> Or it may be more complex case, when we decided to provide all data
> from /proc/pid/fdinfo/X for each descriptor. A set of fields in fdinfo
> depends on a type of a file descriptor, it is different for epoll,
> signalfd, inotify, sockets, etc.
>
> For inotify file descriptors, there are information about all watches,
> so it is not possible to use a fixed size struture to present this data.

Now I didn't look closely at inotify watches but this is done with unions,
zero-length trailing arrays, and other usual stuff. If OpenVZ/Virtuozzo is
doing C/R of inotify then you already have all serializing code.


Re: [1/2,v2] fdmap(2)

2017-10-12 Thread Andrei Vagin
On Wed, Oct 11, 2017 at 09:12:34PM +0300, Alexey Dobriyan wrote:
> On Tue, Oct 10, 2017 at 03:08:06PM -0700, Andrei Vagin wrote:
> > On Sun, Sep 24, 2017 at 11:06:20PM +0300, Alexey Dobriyan wrote:
> > > From: Aliaksandr Patseyenak 
> > > 
> > > Implement system call for bulk retrieveing of opened descriptors
> > > in binary form.
> > > 
> > > Some daemons could use it to reliably close file descriptors
> > > before starting. Currently they close everything upto some number
> > > which formally is not reliable. Other natural users are lsof(1) and CRIU
> > > (although lsof does so much in /proc that the effect is thoroughly 
> > > buried).
> > 
> > Hello Alexey,
> > 
> > I am not sure about the idea to add syscalls for all sort of process
> > attributes. For example, in CRIU we need file descriptors with their
> > properties, which we currently get from /proc/pid/fdinfo/. How can
> > this interface be extended to achieve our goal?
> > 
> > Have you seen the task-diag interface what I sent about a year ago?
> 
> Of course, let's discuss /proc/task_diag.
> 
> Adding it as /proc file is obviously unnecessary: you do it only
> to hook ->read and ->write netlink style
> (and BTW you don't need .THIS_MODULE anymore ;-)
> 
> Transactional netlink send and recv aren't necessary either.
> As I understand it, it comes from old times when netlink was async,
> so 2 syscalls were neccesary. Netlink is not async anymore.
> 
> Basically you want to do sys_task_diag(2) which accepts set of pids
> (maybe) and a mask (see statx()) and returns synchronously result into
> a buffer.

You are not quite right here. We send a request and then we read a
response, which can be bigger than what we can read for one call.

So we need something like a cursor, in your case it is the "start"
argument. But sometimes this cursor contains a kernel internal data
to have a better performance. We need to have a way to address this
cursor from userspace, and it is a reason why we need a file
descriptor in this scheme.

For example, you can look at the proc_maps_private structure.


> 
> > We had a discussion on the previous kernel summit how to rework
> > task-diag, so that it can be merged into the upstream kernel.
> > Unfortunately, I didn't send a summary for this discussion. But it's
> > better now than never. We decided to do something like this:
> > 
> > 1. Add a new syscall readfile(fname, buf, size), which can be
> > used to read small files without opening a file descriptor. It will be
> > useful for proc files, configs, etc.
> 
> If nothing, it should be done because the number of programmers capable
> of writing readfile() in userspace correctly handling all errors and
> short reads is very small indeed. Out of curiosity I once booted a kernel
> which made all reads short by default. It was fascinating I can tell you.
> 
> > 2. bin/text/bin conversion is very slow
> >  - 65.47% proc_pid_status
> >   - 20.81% render_sigset_t
> >- 18.27% seq_printf
> > + 15.77% seq_vprintf
> >   - 10.65% task_mem
> > + 8.78% seq_print
> > + 1.02% hugetlb_rep
> >   + 7.40% seq_printf
> > so a new interface has to use a binary format and the format of netlink
> > messages can be used here. It should be possible to extend a file
> > without breaking backward compatibility.
> 
> Binary -- yes.
> netlink attributes -- maybe.
> 
> There is statx() model which is perfect for this usecase:
> do not want pagecache of all block devices? sure, no problem.
> 
> > 3. There are a lot of objection to use a netlink sockets out of the network
> > subsystem. The idea of using a "transaction" file looks weird for many
> > people, so we decided to add a few files in /proc/pid/. I see
> > minimum two files. One file contains information about a task, it is
> > mostly what we have in /proc/pid/status and /proc/pid/stat. Another file
> > describes a task memory, it is what we have now in /proc/pid/smaps.
> > Here is one more major idea. All attributes in a file has to be equal in
> > term of performance, or by other words there should not be attributes,
> > which significantly affect a generation time of a whole file.
> > 
> > If we look at /proc/pid/smaps, we spend a lot of time to get memory
> > statistics. This file contains a lot of data and if you read it to get
> > VmFlags, the kernel will waste your time by generating a useless data
> > for you.
> 
> There is a unsolvable problem with /proc/*/stat style files. Anyone
> who wants to add new stuff has a desicion to make, whether add new /proc
> file or extend existing /proc file.
> 
> Adding new /proc file means 3 syscalls currently, it surely will become
> better with aforementioned readfileat() but even adding tons of symlinks
> like this:
> 
>   $ readlink /proc/self/affinity
>   0f
> 
> would have been better -- readlink doesn't open files.
> 
> Adding to existing file means _all_ users have to eat the cost as
> read(2) doesn't accept any sort of mask to filter data. Most /proc files
> are seq

Re: [1/2,v2] fdmap(2)

2017-10-11 Thread Alexey Dobriyan
On Tue, Oct 10, 2017 at 03:08:06PM -0700, Andrei Vagin wrote:
> On Sun, Sep 24, 2017 at 11:06:20PM +0300, Alexey Dobriyan wrote:
> > From: Aliaksandr Patseyenak 
> > 
> > Implement system call for bulk retrieveing of opened descriptors
> > in binary form.
> > 
> > Some daemons could use it to reliably close file descriptors
> > before starting. Currently they close everything upto some number
> > which formally is not reliable. Other natural users are lsof(1) and CRIU
> > (although lsof does so much in /proc that the effect is thoroughly buried).
> 
> Hello Alexey,
> 
> I am not sure about the idea to add syscalls for all sort of process
> attributes. For example, in CRIU we need file descriptors with their
> properties, which we currently get from /proc/pid/fdinfo/. How can
> this interface be extended to achieve our goal?
> 
> Have you seen the task-diag interface what I sent about a year ago?

Of course, let's discuss /proc/task_diag.

Adding it as /proc file is obviously unnecessary: you do it only
to hook ->read and ->write netlink style
(and BTW you don't need .THIS_MODULE anymore ;-)

Transactional netlink send and recv aren't necessary either.
As I understand it, it comes from old times when netlink was async,
so 2 syscalls were neccesary. Netlink is not async anymore.

Basically you want to do sys_task_diag(2) which accepts set of pids
(maybe) and a mask (see statx()) and returns synchronously result into
a buffer.

> We had a discussion on the previous kernel summit how to rework
> task-diag, so that it can be merged into the upstream kernel.
> Unfortunately, I didn't send a summary for this discussion. But it's
> better now than never. We decided to do something like this:
> 
> 1. Add a new syscall readfile(fname, buf, size), which can be
> used to read small files without opening a file descriptor. It will be
> useful for proc files, configs, etc.

If nothing, it should be done because the number of programmers capable
of writing readfile() in userspace correctly handling all errors and
short reads is very small indeed. Out of curiosity I once booted a kernel
which made all reads short by default. It was fascinating I can tell you.

> 2. bin/text/bin conversion is very slow
>  - 65.47% proc_pid_status
>   - 20.81% render_sigset_t
>- 18.27% seq_printf
> + 15.77% seq_vprintf
>   - 10.65% task_mem
> + 8.78% seq_print
> + 1.02% hugetlb_rep
>   + 7.40% seq_printf
> so a new interface has to use a binary format and the format of netlink
> messages can be used here. It should be possible to extend a file
> without breaking backward compatibility.

Binary -- yes.
netlink attributes -- maybe.

There is statx() model which is perfect for this usecase:
do not want pagecache of all block devices? sure, no problem.

> 3. There are a lot of objection to use a netlink sockets out of the network
> subsystem. The idea of using a "transaction" file looks weird for many
> people, so we decided to add a few files in /proc/pid/. I see
> minimum two files. One file contains information about a task, it is
> mostly what we have in /proc/pid/status and /proc/pid/stat. Another file
> describes a task memory, it is what we have now in /proc/pid/smaps.
> Here is one more major idea. All attributes in a file has to be equal in
> term of performance, or by other words there should not be attributes,
> which significantly affect a generation time of a whole file.
> 
> If we look at /proc/pid/smaps, we spend a lot of time to get memory
> statistics. This file contains a lot of data and if you read it to get
> VmFlags, the kernel will waste your time by generating a useless data
> for you.

There is a unsolvable problem with /proc/*/stat style files. Anyone
who wants to add new stuff has a desicion to make, whether add new /proc
file or extend existing /proc file.

Adding new /proc file means 3 syscalls currently, it surely will become
better with aforementioned readfileat() but even adding tons of symlinks
like this:

$ readlink /proc/self/affinity
0f

would have been better -- readlink doesn't open files.

Adding to existing file means _all_ users have to eat the cost as
read(2) doesn't accept any sort of mask to filter data. Most /proc files
are seqfiles now which most of the time internally generates whole buffer
before shipping data to userspace. cat(1) does 32KB read by default
which is bigger than most of files in /proc and stat'ing /proc files is
useless because they're all 0 length. Reliable rewinding to necessary data
is possible only with memchr() which misses the point.

Basically, those sacred text files the Universe consists of suck.

With statx() model the cost of extending result with new data is very
small -- 1 branch to skip generation of data.

I suggest that anyone who dares to improve the situation with process
statistics and anything /proc related uses it as a model.

Of course, I also suggest to freeze /proc for new stuff to press
the issue but one can only dream.


Re: [PATCH 1/2 v2] fdmap(2)

2017-10-11 Thread Alexey Dobriyan
On Thu, Sep 28, 2017 at 08:02:23AM -0700, Andy Lutomirski wrote:
> On Thu, Sep 28, 2017 at 3:55 AM, Alexey Dobriyan  wrote:
> > On 9/28/17, Michael Kerrisk (man-pages)  wrote:
> >> On 27 September 2017 at 17:03, Andy Lutomirski  wrote:
> >
>  The idea is to start process. In ideal world, only bynary system calls
>  would exist and shells could emulate /proc/* same way bash implement
>  /dev/tcp
> >>>
> >>> Then start the process by doing it for real and making it obviously
> >>> useful.  We should not add a pair of vaguely useful, rather weak
> >>> syscalls just to start a process of modernizing /proc.
> >
> > Before doing it for real it would be nice to have at least a nod
> > from people in charge that syscalls which return binary
> > information are OK. Otherwise some EIATF guy will just say
> > "NAK /proc is fine, it always was fine".
> 
> There's nothing inherently wrong with syscalls that return binary
> information.  There is something wrong with reinventing the world with
> insufficient justification, though.
> 
> /proc may be old, clunky, and kind of slow, but it has a lot of good
> things going for it.  It supports access control (DAC and MAC).  It
> handles namespacing in a way that's awkward but supported by all the
> existing namespace managers.  It may soon support mount options, which
> is rather important.
> 
> I feel like we've been discussing this performance issue for over a
> year, and I distinctly recall discussing it in Santa Fe.  I suggested
> a two-pronged approach:
> 
> 1. Add a new syscall that will, in a single call, open, read, and
> close a proc file and maybe even a regular non-proc file.  Like this:
> 
> long readfileat(int dirfd, const char *path, void *buf, size_t len, int 
> flags);
> 
> 2. Where needed, add new /proc files with lighter-weight
> representations.  I think we discussed something that's like status
> but in nl_attr format.
> 
> This doesn't end up with a bunch of code duplication the way that a
> full-blown syscall-based reimplementation would.  It supports all the
> /proc features rather than just a subset.  It fully respects access
> control, future mount options, and the potential lack of a /proc
> mount.

Aplogies for delayed answer.

The code duplication exists solely because sometime in the beginning
/proc was chosen. There was another route: design system calls for
getting process statistics and add another binary to coreutils
which uses said system calls so that shell scripts could use them.

It is _never_ late to simply stop expanding /proc.

> > Or look from another angle: sched_setaffinity exists but there is
> > no /proc counterpart, shells must use taskset(1) and world didn't end.
> 
> sched_setaffinity() modifies the caller.  /proc wouldn't have made much sense.

I think you're technically wrong: sched_setaffinity(2) doesn't modify
supplied cpumask in userspace. But even if it did it doesn't matter:
imaginary /proc/*/affinity file would simply be re-read to verify that
it was indeed set correctly to mimic umask(2) behaviour.

> >> I concur.
> >>
> >> Alexey, you still have not wxplained who specifically needs this
> >> right now, and how, precisely, they plan to use the new system calls.
> >> It is all very arm-wavey so far.
> >
> > It is not if you read even example program in the original patch.
> > Any program which queries information about file descriptors
> > will benefit both in CPU and memory usage.
> >
> > void closefrom(int start)
> > {
> > int fd[1024];
> > int n;
> >
> > while ((n = fdmap(0, fd, sizeof(fd)/sizeof(fd[0]), start)) > 0) {
> > unsigned int i;
> >
> > for (i = 0; i < n; i++)
> > close(fd[i]);
> >
> > start = fd[n - 1] + 1;
> > }
> > }
> >
> > CRIU naturally to know everything about descriptors of target processes:
> > It does:
> >
> > int predump_task_files(int pid)
> > {
> > struct dirent *de;
> > DIR *fd_dir;
> > int ret = -1;
> >
> > pr_info("Pre-dump fds for %d)\n", pid);
> >
> > fd_dir = opendir_proc(pid, "fd");
> > if (!fd_dir)
> > return -1;
> >
> > while ((de = readdir(fd_dir))) {
> > if (dir_dots(de))
> > continue;
> >
> > if (predump_one_fd(pid, atoi(de->d_name)))
> > goto out;
> > }
> >
> > ret = 0;
> > out:
> > closedir(fd_dir);
> > return ret;
> > }
> >
> > which is again inefficient.
> 
> And /proc/PID/fds would solve this.

Retaining all /proc problems, to reiterate:
* 3 dentries, 3 inodes (allocating and instantiating),
* 1 struct file + descriptor
* 1 page for seqfile buffer,
* converting, copying more than necessary (strings are never better than binary)

readfileat() would help with "struct file" part, but not with anything else.

fdmap(2) simply avoids _all_ overhead except of that truly necessary:
security 

Re: [1/2,v2] fdmap(2)

2017-10-10 Thread Andrei Vagin
On Sun, Sep 24, 2017 at 11:06:20PM +0300, Alexey Dobriyan wrote:
> From: Aliaksandr Patseyenak 
> 
> Implement system call for bulk retrieveing of opened descriptors
> in binary form.
> 
> Some daemons could use it to reliably close file descriptors
> before starting. Currently they close everything upto some number
> which formally is not reliable. Other natural users are lsof(1) and CRIU
> (although lsof does so much in /proc that the effect is thoroughly buried).

Hello Alexey,

I am not sure about the idea to add syscalls for all sort of process
attributes. For example, in CRIU we need file descriptors with their
properties, which we currently get from /proc/pid/fdinfo/. How can
this interface be extended to achieve our goal?

Have you seen the task-diag interface what I sent about a year ago?

We had a discussion on the previous kernel summit how to rework
task-diag, so that it can be merged into the upstream kernel.
Unfortunately, I didn't send a summary for this discussion. But it's
better now than never. We decided to do something like this:

1. Add a new syscall readfile(fname, buf, size), which can be
used to read small files without opening a file descriptor. It will be
useful for proc files, configs, etc.

2. bin/text/bin conversion is very slow
 - 65.47% proc_pid_status
  - 20.81% render_sigset_t
   - 18.27% seq_printf
+ 15.77% seq_vprintf
  - 10.65% task_mem
+ 8.78% seq_print
+ 1.02% hugetlb_rep
  + 7.40% seq_printf
so a new interface has to use a binary format and the format of netlink
messages can be used here. It should be possible to extend a file
without breaking backward compatibility.

3. There are a lot of objection to use a netlink sockets out of the network
subsystem. The idea of using a "transaction" file looks weird for many
people, so we decided to add a few files in /proc/pid/. I see
minimum two files. One file contains information about a task, it is
mostly what we have in /proc/pid/status and /proc/pid/stat. Another file
describes a task memory, it is what we have now in /proc/pid/smaps.
Here is one more major idea. All attributes in a file has to be equal in
term of performance, or by other words there should not be attributes,
which significantly affect a generation time of a whole file.

If we look at /proc/pid/smaps, we spend a lot of time to get memory
statistics. This file contains a lot of data and if you read it to get
VmFlags, the kernel will waste your time by generating a useless data
for you.

Here is my slides for this discussion:
https://www.linuxplumbersconf.org/2016/ocw/system/presentations/4599/original/Netlink-issues.pdf

Could you add me into recipients for this sort of patches in a future?

Thanks,
Andrei

> 
> /proc, the only way to learn anything about file descriptors may not be
> available. There is unavoidable overhead associated with instantiating
> 3 dentries and 3 inodes and converting integers to strings and back.
> 
> Benchmark:
> 
>   N=1<<22 times
>   4 opened descriptors (0, 1, 2, 3)
>   opendir+readdir+closedir /proc/self/fd vs fdmap
> 
>   /proc 8.31 ± 0.37%
>   fdmap 0.32 ± 0.72%
> 
> 
> FDMAP(2)   Linux Programmer's Manual  FDMAP(2)
> 
> NAME
>fdmap - get open file descriptors of the process
> 
> SYNOPSIS
>long fdmap(pid_t pid, int *fd, unsigned int nfd, int start, int flags);
> 
> DESCRIPTION
>fdmap()  writes  open  file  descriptors  of the process into buffer fd
>starting from the start descriptor. At most nfd elements  are  written.
>flags argument is reserved and must be zero.
> 
>If pid is zero, syscall will work with the current process.
> 
> RETURN VALUE
>On success, number of descriptors written is returned.  On error, -1 is
>returned, and errno is set appropriately.
> 
> ERRORS
>ESRCH  No such process.
> 
>EACCES Permission denied.
> 
>EFAULT Invalid fd pointer.
> 
>EINVAL Negative start argument.
> 
> NOTES
>Glibc does not provide a wrapper for these system call; call  it  using
>syscall(2).
> 
> EXAMPLE
>The program below demonstrates fdmap() usage.
> 
>$ ./a.out $$
>0 1 2 255
> 
>$ ./a.out 420 1 2 42
>1023
> 
>Program source
> 
>#include 
>#include 
>#include 
> 
>static inline long fdmap(int pid, int *fd, unsigned int nfd, unsigned 
> int start, int flags)
>{
> register long r10 asm ("r10") = start;
> register long r8 asm ("r8") = flags;
> long rv;
> asm volatile (
>  "syscall"
>  : "=a" (rv)
>  : "0" (333), "D" (pid), "S" (fd), "d" (nfd), "r" (r10), "r" 
> (r8)
>  : "rcx", "r11", "cc", "memory"
> );
> return rv;
>}
> 
>int main(int argc, char *argv[])
>{
> pid_t pid;
>  

Re: [PATCH 1/2 v2] fdmap(2)

2017-09-28 Thread Andy Lutomirski
On Thu, Sep 28, 2017 at 3:55 AM, Alexey Dobriyan  wrote:
> On 9/28/17, Michael Kerrisk (man-pages)  wrote:
>> On 27 September 2017 at 17:03, Andy Lutomirski  wrote:
>
 The idea is to start process. In ideal world, only bynary system calls
 would exist and shells could emulate /proc/* same way bash implement
 /dev/tcp
>>>
>>> Then start the process by doing it for real and making it obviously
>>> useful.  We should not add a pair of vaguely useful, rather weak
>>> syscalls just to start a process of modernizing /proc.
>
> Before doing it for real it would be nice to have at least a nod
> from people in charge that syscalls which return binary
> information are OK. Otherwise some EIATF guy will just say
> "NAK /proc is fine, it always was fine".

There's nothing inherently wrong with syscalls that return binary
information.  There is something wrong with reinventing the world with
insufficient justification, though.

/proc may be old, clunky, and kind of slow, but it has a lot of good
things going for it.  It supports access control (DAC and MAC).  It
handles namespacing in a way that's awkward but supported by all the
existing namespace managers.  It may soon support mount options, which
is rather important.

I feel like we've been discussing this performance issue for over a
year, and I distinctly recall discussing it in Santa Fe.  I suggested
a two-pronged approach:

1. Add a new syscall that will, in a single call, open, read, and
close a proc file and maybe even a regular non-proc file.  Like this:

long readfileat(int dirfd, const char *path, void *buf, size_t len, int flags);

2. Where needed, add new /proc files with lighter-weight
representations.  I think we discussed something that's like status
but in nl_attr format.

This doesn't end up with a bunch of code duplication the way that a
full-blown syscall-based reimplementation would.  It supports all the
/proc features rather than just a subset.  It fully respects access
control, future mount options, and the potential lack of a /proc
mount.

>
> Or look from another angle: sched_setaffinity exists but there is
> no /proc counterpart, shells must use taskset(1) and world didn't end.

sched_setaffinity() modifies the caller.  /proc wouldn't have made much sense.

>
>> I concur.
>>
>> Alexey, you still have not wxplained who specifically needs this
>> right now, and how, precisely, they plan to use the new system calls.
>> It is all very arm-wavey so far.
>
> It is not if you read even example program in the original patch.
> Any program which queries information about file descriptors
> will benefit both in CPU and memory usage.
>
> void closefrom(int start)
> {
> int fd[1024];
> int n;
>
> while ((n = fdmap(0, fd, sizeof(fd)/sizeof(fd[0]), start)) > 0) {
> unsigned int i;
>
> for (i = 0; i < n; i++)
> close(fd[i]);
>
> start = fd[n - 1] + 1;
> }
> }
>
> CRIU naturally to know everything about descriptors of target processes:
> It does:
>
> int predump_task_files(int pid)
> {
> struct dirent *de;
> DIR *fd_dir;
> int ret = -1;
>
> pr_info("Pre-dump fds for %d)\n", pid);
>
> fd_dir = opendir_proc(pid, "fd");
> if (!fd_dir)
> return -1;
>
> while ((de = readdir(fd_dir))) {
> if (dir_dots(de))
> continue;
>
> if (predump_one_fd(pid, atoi(de->d_name)))
> goto out;
> }
>
> ret = 0;
> out:
> closedir(fd_dir);
> return ret;
> }
>
> which is again inefficient.

And /proc/PID/fds would solve this.


Re: [PATCH 1/2 v2] fdmap(2)

2017-09-28 Thread Alexey Dobriyan
On 9/28/17, Michael Kerrisk (man-pages)  wrote:
> On 27 September 2017 at 17:03, Andy Lutomirski  wrote:

>>> The idea is to start process. In ideal world, only bynary system calls
>>> would exist and shells could emulate /proc/* same way bash implement
>>> /dev/tcp
>>
>> Then start the process by doing it for real and making it obviously
>> useful.  We should not add a pair of vaguely useful, rather weak
>> syscalls just to start a process of modernizing /proc.

Before doing it for real it would be nice to have at least a nod
from people in charge that syscalls which return binary
information are OK. Otherwise some EIATF guy will just say
"NAK /proc is fine, it always was fine".

Or look from another angle: sched_setaffinity exists but there is
no /proc counterpart, shells must use taskset(1) and world didn't end.

> I concur.
>
> Alexey, you still have not wxplained who specifically needs this
> right now, and how, precisely, they plan to use the new system calls.
> It is all very arm-wavey so far.

It is not if you read even example program in the original patch.
Any program which queries information about file descriptors
will benefit both in CPU and memory usage.

void closefrom(int start)
{
int fd[1024];
int n;

while ((n = fdmap(0, fd, sizeof(fd)/sizeof(fd[0]), start)) > 0) {
unsigned int i;

for (i = 0; i < n; i++)
close(fd[i]);

start = fd[n - 1] + 1;
}
}

CRIU naturally to know everything about descriptors of target processes:
It does:

int predump_task_files(int pid)
{
struct dirent *de;
DIR *fd_dir;
int ret = -1;

pr_info("Pre-dump fds for %d)\n", pid);

fd_dir = opendir_proc(pid, "fd");
if (!fd_dir)
return -1;

while ((de = readdir(fd_dir))) {
if (dir_dots(de))
continue;

if (predump_one_fd(pid, atoi(de->d_name)))
goto out;
}

ret = 0;
out:
closedir(fd_dir);
return ret;
}

which is again inefficient.


Re: [PATCH 1/2 v2] fdmap(2)

2017-09-28 Thread Alexey Dobriyan
On 9/27/17, Andy Lutomirski  wrote:
> On Tue, Sep 26, 2017 at 12:00 PM, Alexey Dobriyan 
> wrote:
>> On Mon, Sep 25, 2017 at 09:42:58AM +0200, Michael Kerrisk (man-pages)
>> wrote:
>>> [Not sure why original author is not in CC; added]
>>>
>>> Hello Alexey,
>>>
>>> On 09/24/2017 10:06 PM, Alexey Dobriyan wrote:
>>> > From: Aliaksandr Patseyenak 
>>> >
>>> > Implement system call for bulk retrieveing of opened descriptors
>>> > in binary form.
>>> >
>>> > Some daemons could use it to reliably close file descriptors
>>> > before starting. Currently they close everything upto some number
>>> > which formally is not reliable. Other natural users are lsof(1) and
>>> > CRIU
>>> > (although lsof does so much in /proc that the effect is thoroughly
>>> > buried).
>>> >
>>> > /proc, the only way to learn anything about file descriptors may not
>>> > be
>>> > available. There is unavoidable overhead associated with instantiating
>>> > 3 dentries and 3 inodes and converting integers to strings and back.
>>> >
>>> > Benchmark:
>>> >
>>> > N=1<<22 times
>>> > 4 opened descriptors (0, 1, 2, 3)
>>> > opendir+readdir+closedir /proc/self/fd vs fdmap
>>> >
>>> > /proc 8.31 ą 0.37%
>>> > fdmap 0.32 ą 0.72%
>>>
>>> From the text above, I'm still trying to understand: whose problem
>>> does this solve? I mean, we've lived with the daemon-close-all-files
>>> technique forever (and I'm not sure that performance is really an
>>> important issue for the daemon case) .
>>
>>> And you say that the effect for lsof(1) will be buried.
>>
>> If only fdmap(2) is added, then effect will be negligible for lsof
>> because it has to go through /proc anyway.
>>
>> The idea is to start process. In ideal world, only bynary system calls
>> would exist and shells could emulate /proc/* same way bash implement
>> /dev/tcp
>
> Then start the process by doing it for real and making it obviously
> useful.  We should not add a pair of vaguely useful, rather weak
> syscalls just to start a process of modernizing /proc.
>
>>
>>> So, who does this new system call
>>> really help? (Note: I'm not saying don't add the syscall, but from
>>> explanation given here, it's not clear why we should.)
>>
>> For fdmap(2) natural users are lsof(), CRIU.
>
> lsof does:
>
> int
> main(argc, argv)
> int argc;
> char *argv[];
> {
> ...
> if ((MaxFd = (int) GET_MAX_FD()) < 53)
> MaxFd = 53;
> for (i = 3; i < MaxFd; i++)
> (void) close(i);
>
> The solution isn't to wrangle fdmap(2) into this code.  The solution
> is to remove the code entirely.

What do you think about this code from OpenSSH?

/*
 * Discard other fds that are hanging around. These can cause problem
 * with backgrounded ssh processes started by ControlPersist.
 */
closefrom(STDERR_FILENO + 1);


Re: [PATCH 1/2 v2] fdmap(2)

2017-09-28 Thread Michael Kerrisk (man-pages)
On 27 September 2017 at 17:03, Andy Lutomirski  wrote:
> On Tue, Sep 26, 2017 at 12:00 PM, Alexey Dobriyan  wrote:
>> On Mon, Sep 25, 2017 at 09:42:58AM +0200, Michael Kerrisk (man-pages) wrote:
>>> [Not sure why original author is not in CC; added]
>>>
>>> Hello Alexey,
>>>
>>> On 09/24/2017 10:06 PM, Alexey Dobriyan wrote:
>>> > From: Aliaksandr Patseyenak 
>>> >
>>> > Implement system call for bulk retrieveing of opened descriptors
>>> > in binary form.
>>> >
>>> > Some daemons could use it to reliably close file descriptors
>>> > before starting. Currently they close everything upto some number
>>> > which formally is not reliable. Other natural users are lsof(1) and CRIU
>>> > (although lsof does so much in /proc that the effect is thoroughly 
>>> > buried).
>>> >
>>> > /proc, the only way to learn anything about file descriptors may not be
>>> > available. There is unavoidable overhead associated with instantiating
>>> > 3 dentries and 3 inodes and converting integers to strings and back.
>>> >
>>> > Benchmark:
>>> >
>>> > N=1<<22 times
>>> > 4 opened descriptors (0, 1, 2, 3)
>>> > opendir+readdir+closedir /proc/self/fd vs fdmap
>>> >
>>> > /proc 8.31 ą 0.37%
>>> > fdmap 0.32 ą 0.72%
>>>
>>> From the text above, I'm still trying to understand: whose problem
>>> does this solve? I mean, we've lived with the daemon-close-all-files
>>> technique forever (and I'm not sure that performance is really an
>>> important issue for the daemon case) .
>>
>>> And you say that the effect for lsof(1) will be buried.
>>
>> If only fdmap(2) is added, then effect will be negligible for lsof
>> because it has to go through /proc anyway.
>>
>> The idea is to start process. In ideal world, only bynary system calls
>> would exist and shells could emulate /proc/* same way bash implement
>> /dev/tcp
>
> Then start the process by doing it for real and making it obviously
> useful.  We should not add a pair of vaguely useful, rather weak
> syscalls just to start a process of modernizing /proc.

I concur.

Alexey, you still have not wxplained who specifically needs this
right now, and how, precisely, they plan to use the new system calls.
It is all very arm-wavey so far.

Thanks,

Michael


Re: [PATCH 1/2 v2] fdmap(2)

2017-09-27 Thread Andy Lutomirski
On Tue, Sep 26, 2017 at 12:00 PM, Alexey Dobriyan  wrote:
> On Mon, Sep 25, 2017 at 09:42:58AM +0200, Michael Kerrisk (man-pages) wrote:
>> [Not sure why original author is not in CC; added]
>>
>> Hello Alexey,
>>
>> On 09/24/2017 10:06 PM, Alexey Dobriyan wrote:
>> > From: Aliaksandr Patseyenak 
>> >
>> > Implement system call for bulk retrieveing of opened descriptors
>> > in binary form.
>> >
>> > Some daemons could use it to reliably close file descriptors
>> > before starting. Currently they close everything upto some number
>> > which formally is not reliable. Other natural users are lsof(1) and CRIU
>> > (although lsof does so much in /proc that the effect is thoroughly buried).
>> >
>> > /proc, the only way to learn anything about file descriptors may not be
>> > available. There is unavoidable overhead associated with instantiating
>> > 3 dentries and 3 inodes and converting integers to strings and back.
>> >
>> > Benchmark:
>> >
>> > N=1<<22 times
>> > 4 opened descriptors (0, 1, 2, 3)
>> > opendir+readdir+closedir /proc/self/fd vs fdmap
>> >
>> > /proc 8.31 ą 0.37%
>> > fdmap 0.32 ą 0.72%
>>
>> From the text above, I'm still trying to understand: whose problem
>> does this solve? I mean, we've lived with the daemon-close-all-files
>> technique forever (and I'm not sure that performance is really an
>> important issue for the daemon case) .
>
>> And you say that the effect for lsof(1) will be buried.
>
> If only fdmap(2) is added, then effect will be negligible for lsof
> because it has to go through /proc anyway.
>
> The idea is to start process. In ideal world, only bynary system calls
> would exist and shells could emulate /proc/* same way bash implement
> /dev/tcp

Then start the process by doing it for real and making it obviously
useful.  We should not add a pair of vaguely useful, rather weak
syscalls just to start a process of modernizing /proc.

>
>> So, who does this new system call
>> really help? (Note: I'm not saying don't add the syscall, but from
>> explanation given here, it's not clear why we should.)
>
> For fdmap(2) natural users are lsof(), CRIU.

lsof does:

int
main(argc, argv)
int argc;
char *argv[];
{
...
if ((MaxFd = (int) GET_MAX_FD()) < 53)
MaxFd = 53;
for (i = 3; i < MaxFd; i++)
(void) close(i);

The solution isn't to wrangle fdmap(2) into this code.  The solution
is to remove the code entirely.


Re: [PATCH 1/2 v2] fdmap(2)

2017-09-26 Thread Alexey Dobriyan
On Mon, Sep 25, 2017 at 09:42:58AM +0200, Michael Kerrisk (man-pages) wrote:
> [Not sure why original author is not in CC; added]
> 
> Hello Alexey,
> 
> On 09/24/2017 10:06 PM, Alexey Dobriyan wrote:
> > From: Aliaksandr Patseyenak 
> > 
> > Implement system call for bulk retrieveing of opened descriptors
> > in binary form.
> > 
> > Some daemons could use it to reliably close file descriptors
> > before starting. Currently they close everything upto some number
> > which formally is not reliable. Other natural users are lsof(1) and CRIU
> > (although lsof does so much in /proc that the effect is thoroughly buried).
> > 
> > /proc, the only way to learn anything about file descriptors may not be
> > available. There is unavoidable overhead associated with instantiating
> > 3 dentries and 3 inodes and converting integers to strings and back.
> > 
> > Benchmark:
> > 
> > N=1<<22 times
> > 4 opened descriptors (0, 1, 2, 3)
> > opendir+readdir+closedir /proc/self/fd vs fdmap
> > 
> > /proc 8.31 ± 0.37%
> > fdmap 0.32 ± 0.72%
> 
> From the text above, I'm still trying to understand: whose problem
> does this solve? I mean, we've lived with the daemon-close-all-files
> technique forever (and I'm not sure that performance is really an 
> important issue for the daemon case) .

> And you say that the effect for lsof(1) will be buried.

If only fdmap(2) is added, then effect will be negligible for lsof
because it has to go through /proc anyway.

The idea is to start process. In ideal world, only bynary system calls
would exist and shells could emulate /proc/* same way bash implement
/dev/tcp

> So, who does this new system call
> really help? (Note: I'm not saying don't add the syscall, but from
> explanation given here, it's not clear why we should.)

For fdmap(2) natural users are lsof(), CRIU.

At some point, checkpointing was moved to userspace forcing them
to run all over /proc extracting information which could be recovered in
couple of locks, bunch of list iterations and dereferences (just read CRIU).
All of this could not be beneficial for performance.

Parsing text files doesn't help either: most of the numbers in
/proc/*/stat et al are unpadded decimals so that user can't rewind to
exact field he wants.


Re: [PATCH 1/2 v2] fdmap(2)

2017-09-26 Thread Alexey Dobriyan
On Sun, Sep 24, 2017 at 02:31:23PM -0700, Andy Lutomirski wrote:
> On Sun, Sep 24, 2017 at 1:06 PM, Alexey Dobriyan  wrote:
> > From: Aliaksandr Patseyenak 
> >
> > Implement system call for bulk retrieveing of opened descriptors
> > in binary form.
> >
> > Some daemons could use it to reliably close file descriptors
> > before starting. Currently they close everything upto some number
> > which formally is not reliable. Other natural users are lsof(1) and CRIU
> > (although lsof does so much in /proc that the effect is thoroughly buried).
> >
> > /proc, the only way to learn anything about file descriptors may not be
> > available. There is unavoidable overhead associated with instantiating
> > 3 dentries and 3 inodes and converting integers to strings and back.
> >
> > Benchmark:
> >
> > N=1<<22 times
> > 4 opened descriptors (0, 1, 2, 3)
> > opendir+readdir+closedir /proc/self/fd vs fdmap
> >
> > /proc 8.31 ą 0.37%
> > fdmap 0.32 ą 0.72%
> 
> This doesn't have the semantic problem that pidmap does, but I still
> wonder why this can't be accomplished by adding a new file in /proc.

It can be done in /proc but the point of the exercise is to skip all the
overhead: in this case dcache, 1 descriptor for readdir, conversion
from binary to string.

The problem is much deeper, namely, EIATF people force everyone else
to cater to Unix shells so that they can do read() on them because
Unix shells can't do system calls like real programming languages.
The only way to fix this problem is to ignore Unix shells and start
introducing binary system calls so that normal people aren't forced
to make their programs slower than necessary.

Example: lsof(1) does close() from 3 to 1023 inclusive on startup.
I don't know why but it does it. 1 syscall = 1 us, 1000 syscalls = 1 ms
wasted because all of them will return -EBADF normally. With fdmap(2),
lsof would do 2 fdmap() calls (1 real + 1 to confirm no more descriptors
are available + 0 closes in normal situation). That's 2 syscalls vs 1020.

Obviously, for binary model to work fdmap(2) needs to be complemented
by other system calls all of which will bypass /proc for, say, extracting
/proc/$PID/fd/$i symlink content and fdinfo. Currently, if you use
fdmap(2) you still have to fish in /proc for the rest of the data.


Re: [PATCH 1/2 v2] fdmap(2)

2017-09-25 Thread kbuild test robot
Hi Aliaksandr,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.14-rc2 next-20170922]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Alexey-Dobriyan/fdmap-2/20170926-105150
config: frv-defconfig (attached as .config)
compiler: frv-linux-gcc (GCC) 6.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=frv 

All warnings (new ones prefixed by >>):

   frv-linux-ld: Warning: size of symbol `sys_setuid' changed from 272 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_setregid' changed from 308 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_setgid' changed from 212 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_setreuid' changed from 456 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_setresuid' changed from 424 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_getresuid' changed from 220 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_setresgid' changed from 352 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_getresgid' changed from 216 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_setfsuid' changed from 248 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_setfsgid' changed from 224 in 
kernel/sys.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_capget' changed from 448 in 
kernel/capability.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_capset' changed from 492 in 
kernel/capability.o to 8 in kernel/sys_ni.o
   frv-linux-ld: Warning: size of symbol `sys_getgroups' changed from 8 in 
kernel/sys_ni.o to 176 in kernel/groups.o
   frv-linux-ld: Warning: size of symbol `sys_setgroups' changed from 8 in 
kernel/sys_ni.o to 392 in kernel/groups.o
   frv-linux-ld: Warning: size of symbol `sched_clock' changed from 56 in 
arch/frv/kernel/time.o to 40 in kernel/sched/clock.o
   frv-linux-ld: Warning: size of symbol `arch_cpu_idle' changed from 64 in 
arch/frv/kernel/process.o to 20 in kernel/sched/idle.o
   frv-linux-ld: Warning: size of symbol `sys_membarrier' changed from 8 in 
kernel/sys_ni.o to 44 in kernel/sched/membarrier.o
   frv-linux-ld: Warning: size of symbol `sys_syslog' changed from 8 in 
kernel/sys_ni.o to 16 in kernel/printk/printk.o
   frv-linux-ld: Warning: size of symbol `early_irq_init' changed from 8 in 
kernel/softirq.o to 240 in kernel/irq/irqdesc.o
   frv-linux-ld: Warning: size of symbol `arch_show_interrupts' changed from 84 
in arch/frv/kernel/irq.o to 8 in kernel/irq/proc.o
   frv-linux-ld: Warning: size of symbol `read_persistent_clock' changed from 
160 in arch/frv/kernel/time.o to 12 in kernel/time/timekeeping.o
   frv-linux-ld: Warning: size of symbol `sys_set_robust_list' changed from 8 
in kernel/sys_ni.o to 48 in kernel/futex.o
   frv-linux-ld: Warning: size of symbol `sys_get_robust_list' changed from 8 
in kernel/sys_ni.o to 276 in kernel/futex.o
   frv-linux-ld: Warning: size of symbol `sys_futex' changed from 8 in 
kernel/sys_ni.o to 400 in kernel/futex.o
   frv-linux-ld: Warning: size of symbol `sys_chown16' changed from 8 in 
kernel/sys_ni.o to 84 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_lchown16' changed from 8 in 
kernel/sys_ni.o to 84 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_fchown16' changed from 8 in 
kernel/sys_ni.o to 84 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setregid16' changed from 8 in 
kernel/sys_ni.o to 84 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setgid16' changed from 8 in 
kernel/sys_ni.o to 48 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setreuid16' changed from 8 in 
kernel/sys_ni.o to 84 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setuid16' changed from 8 in 
kernel/sys_ni.o to 48 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setresuid16' changed from 8 in 
kernel/sys_ni.o to 120 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_getresuid16' changed from 8 in 
kernel/sys_ni.o to 312 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setresgid16' changed from 8 in 
kernel/sys_ni.o to 120 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_getresgid16' changed from 8 in 
kernel/sys_ni.o to 312 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setfsuid16' changed from 8 in 
kernel/sys_ni.o to 48 in kernel/uid16.o
   frv-linux-ld: Warning: size of symbol `sys_setfsgid

Re: [PATCH 1/2 v2] fdmap(2)

2017-09-25 Thread Michael Kerrisk (man-pages)
[Not sure why original author is not in CC; added]

Hello Alexey,

On 09/24/2017 10:06 PM, Alexey Dobriyan wrote:
> From: Aliaksandr Patseyenak 
> 
> Implement system call for bulk retrieveing of opened descriptors
> in binary form.
> 
> Some daemons could use it to reliably close file descriptors
> before starting. Currently they close everything upto some number
> which formally is not reliable. Other natural users are lsof(1) and CRIU
> (although lsof does so much in /proc that the effect is thoroughly buried).
> 
> /proc, the only way to learn anything about file descriptors may not be
> available. There is unavoidable overhead associated with instantiating
> 3 dentries and 3 inodes and converting integers to strings and back.
> 
> Benchmark:
> 
>   N=1<<22 times
>   4 opened descriptors (0, 1, 2, 3)
>   opendir+readdir+closedir /proc/self/fd vs fdmap
> 
>   /proc 8.31 ± 0.37%
>   fdmap 0.32 ± 0.72%

>From the text above, I'm still trying to understand: whose problem
does this solve? I mean, we've lived with the daemon-close-all-files
technique forever (and I'm not sure that performance is really an 
important issue for the daemon case) . And you say that the effect
for lsof(1) will be buried. So, who does this new system call
really help? (Note: I'm not saying don't add the syscall, but from
explanation given here, it's not clear why we should.)

Thanks,

Michael


> FDMAP(2)   Linux Programmer's Manual  FDMAP(2)
> 
> NAME
>fdmap - get open file descriptors of the process
> 
> SYNOPSIS
>long fdmap(pid_t pid, int *fd, unsigned int nfd, int start, int flags);
> 
> DESCRIPTION
>fdmap()  writes  open  file  descriptors  of the process into buffer fd
>starting from the start descriptor. At most nfd elements  are  written.
>flags argument is reserved and must be zero.
> 
>If pid is zero, syscall will work with the current process.
> 
> RETURN VALUE
>On success, number of descriptors written is returned.  On error, -1 is
>returned, and errno is set appropriately.
> 
> ERRORS
>ESRCH  No such process.
> 
>EACCES Permission denied.
> 
>EFAULT Invalid fd pointer.
> 
>EINVAL Negative start argument.
> 
> NOTES
>Glibc does not provide a wrapper for these system call; call  it  using
>syscall(2).
> 
> EXAMPLE
>The program below demonstrates fdmap() usage.
> 
>$ ./a.out $$
>0 1 2 255
> 
>$ ./a.out 420 1 2 42
>1023
> 
>Program source
> 
>#include 
>#include 
>#include 
> 
>static inline long fdmap(int pid, int *fd, unsigned int nfd, unsigned 
> int start, int flags)
>{
> register long r10 asm ("r10") = start;
> register long r8 asm ("r8") = flags;
> long rv;
> asm volatile (
>  "syscall"
>  : "=a" (rv)
>  : "0" (333), "D" (pid), "S" (fd), "d" (nfd), "r" (r10), "r" 
> (r8)
>  : "rcx", "r11", "cc", "memory"
> );
> return rv;
>}
> 
>int main(int argc, char *argv[])
>{
> pid_t pid;
> int fd[4];
> unsigned int start;
> int n;
> 
> pid = 0;
> if (argc > 1)
>  pid = atoi(argv[1]);
> 
> start = 0;
> while ((n = fdmap(pid, fd, sizeof(fd)/sizeof(fd[0]), start, 0)) > 
> 0) {
>  unsigned int i;
> 
>  for (i = 0; i < n; i++)
>   printf("%u ", fd[i]);
>  printf("\n");
> 
>  start = fd[n - 1] + 1;
> }
> return 0;
>}
> 
> Linux 2017-09-21  FDMAP(2)
> 
> Changelog:
> 
>   CONFIG_PIDMAP option
>   manpage
> 
> 
> Signed-off-by: Aliaksandr Patseyenak 
> Signed-off-by: Alexey Dobriyan 
> 
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/Makefile|   2 +
>  fs/fdmap.c | 105 
>  include/linux/syscalls.h   |   2 +
>  init/Kconfig   |   7 ++
>  kernel/sys_ni.c|   2 +
>  tools/testing/selftests/Makefile   |   1 +
>  tools/testing/selftests/fdmap/.gitignore   |   1 +
>  tools/testing/selftests/fdmap/Makefile |   7 ++
>  tools/testing/selftests/fdmap/fdmap.c  | 112 +
>  tools/testing/selftests/fdmap/fdmap.h  |  12 +++
>  tools/testing/selftests/fdmap/fdmap_test.c | 153 
> +
>  12 files changed, 405 insertions(+)
>  create mode 100644 fs/fdmap.c
>  create mode 100644 tools/testing/selftests/fdmap/.gitignore
>  create mode 100644 tools/testing/selftests/fdmap/Makefile
>  create mode 100644 tools/test

Re: [PATCH 1/2 v2] fdmap(2)

2017-09-24 Thread Andy Lutomirski
On Sun, Sep 24, 2017 at 1:06 PM, Alexey Dobriyan  wrote:
> From: Aliaksandr Patseyenak 
>
> Implement system call for bulk retrieveing of opened descriptors
> in binary form.
>
> Some daemons could use it to reliably close file descriptors
> before starting. Currently they close everything upto some number
> which formally is not reliable. Other natural users are lsof(1) and CRIU
> (although lsof does so much in /proc that the effect is thoroughly buried).
>
> /proc, the only way to learn anything about file descriptors may not be
> available. There is unavoidable overhead associated with instantiating
> 3 dentries and 3 inodes and converting integers to strings and back.
>
> Benchmark:
>
> N=1<<22 times
> 4 opened descriptors (0, 1, 2, 3)
> opendir+readdir+closedir /proc/self/fd vs fdmap
>
> /proc 8.31 ą 0.37%
> fdmap 0.32 ą 0.72%

This doesn't have the semantic problem that pidmap does, but I still
wonder why this can't be accomplished by adding a new file in /proc.


[PATCH 1/2 v2] fdmap(2)

2017-09-24 Thread Alexey Dobriyan
From: Aliaksandr Patseyenak 

Implement system call for bulk retrieveing of opened descriptors
in binary form.

Some daemons could use it to reliably close file descriptors
before starting. Currently they close everything upto some number
which formally is not reliable. Other natural users are lsof(1) and CRIU
(although lsof does so much in /proc that the effect is thoroughly buried).

/proc, the only way to learn anything about file descriptors may not be
available. There is unavoidable overhead associated with instantiating
3 dentries and 3 inodes and converting integers to strings and back.

Benchmark:

N=1<<22 times
4 opened descriptors (0, 1, 2, 3)
opendir+readdir+closedir /proc/self/fd vs fdmap

/proc 8.31 ± 0.37%
fdmap 0.32 ± 0.72%


FDMAP(2)   Linux Programmer's Manual  FDMAP(2)

NAME
   fdmap - get open file descriptors of the process

SYNOPSIS
   long fdmap(pid_t pid, int *fd, unsigned int nfd, int start, int flags);

DESCRIPTION
   fdmap()  writes  open  file  descriptors  of the process into buffer fd
   starting from the start descriptor. At most nfd elements  are  written.
   flags argument is reserved and must be zero.

   If pid is zero, syscall will work with the current process.

RETURN VALUE
   On success, number of descriptors written is returned.  On error, -1 is
   returned, and errno is set appropriately.

ERRORS
   ESRCH  No such process.

   EACCES Permission denied.

   EFAULT Invalid fd pointer.

   EINVAL Negative start argument.

NOTES
   Glibc does not provide a wrapper for these system call; call  it  using
   syscall(2).

EXAMPLE
   The program below demonstrates fdmap() usage.

   $ ./a.out $$
   0 1 2 255

   $ ./a.out 42
   #include 
   #include 

   static inline long fdmap(int pid, int *fd, unsigned int nfd, unsigned 
int start, int flags)
   {
register long r10 asm ("r10") = start;
register long r8 asm ("r8") = flags;
long rv;
asm volatile (
 "syscall"
 : "=a" (rv)
 : "0" (333), "D" (pid), "S" (fd), "d" (nfd), "r" (r10), "r" 
(r8)
 : "rcx", "r11", "cc", "memory"
);
return rv;
   }

   int main(int argc, char *argv[])
   {
pid_t pid;
int fd[4];
unsigned int start;
int n;

pid = 0;
if (argc > 1)
 pid = atoi(argv[1]);

start = 0;
while ((n = fdmap(pid, fd, sizeof(fd)/sizeof(fd[0]), start, 0)) > 
0) {
 unsigned int i;

 for (i = 0; i < n; i++)
  printf("%u ", fd[i]);
 printf("\n");

 start = fd[n - 1] + 1;
}
return 0;
   }

Linux 2017-09-21  FDMAP(2)

Changelog:

CONFIG_PIDMAP option
manpage


Signed-off-by: Aliaksandr Patseyenak 
Signed-off-by: Alexey Dobriyan 

---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/Makefile|   2 +
 fs/fdmap.c | 105 
 include/linux/syscalls.h   |   2 +
 init/Kconfig   |   7 ++
 kernel/sys_ni.c|   2 +
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/fdmap/.gitignore   |   1 +
 tools/testing/selftests/fdmap/Makefile |   7 ++
 tools/testing/selftests/fdmap/fdmap.c  | 112 +
 tools/testing/selftests/fdmap/fdmap.h  |  12 +++
 tools/testing/selftests/fdmap/fdmap_test.c | 153 +
 12 files changed, 405 insertions(+)
 create mode 100644 fs/fdmap.c
 create mode 100644 tools/testing/selftests/fdmap/.gitignore
 create mode 100644 tools/testing/selftests/fdmap/Makefile
 create mode 100644 tools/testing/selftests/fdmap/fdmap.c
 create mode 100644 tools/testing/selftests/fdmap/fdmap.h
 create mode 100644 tools/testing/selftests/fdmap/fdmap_test.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..9bfe5f79674f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  fdmap   sys_fdmap
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..27476a66c18e 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,6 +13,8 @@ obj-y :=  open.o read_write.o file_table.o super.o \
pnode.o splice.o sync.o utimes.o \