Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Davidlohr Bueso

On Wed, 07 Nov 2018, Davidlohr Bueso wrote:

I have not looked at how filesystems tune the batch size, but it would 
certainly be worth
looking into methinks.


nm this part, percpu_counter_batch is not tunable. It would
still probably be acceptable (famous last words) to at least
move the bottleneck in question to percpu_counter api.

Thanks,
Davidlohr


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Davidlohr Bueso

On Thu, 08 Nov 2018, Dave Chinner wrote:


If only we had percpu counters that had a fixed, extremely low read
overhead that doesn't care about the number of CPUs in the
machine

Oh, wait, we do: percpu_counters.[ch].

This all seems like a counter implementation deficiency to me, not
an interface problem...


Yeah fair point, as long as we can sacrifice accuracy by replacing
kernel_stat -- or maybe just replace the hard irq stats, which I
still think only accounts for 1% of all stat users. I have not looked
at how filesystems tune the batch size, but it would certainly be worth
looking into methinks.

Thanks,
Davidlohr


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Dave Chinner
On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
>  wrote:
> > On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione  
> > wrote:
> >
> >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
> >> > This patch introduces a new /proc/stat2 file that is identical to the
> >> > regular 'stat' except that it zeroes all hard irq statistics. The new
> >> > file is a drop in replacement to stat for users that need performance.
> >>
> >> For a while now, I've been thinking over ways to improve the
> >> performance of collecting various bits of kernel information. I don't
> >> think that a proliferation of special-purpose named bag-of-fields file
> >> variants is the right answer, because even if you add a few info-file
> >> variants, you're still left with a situation where a given file
> >> provides a particular caller with too little or too much information.
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> 
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.

This doesn't solve the problem.

The problem is that this specific implementation of per-cpu
counters need to be summed on every read. Hence when you have a huge
number of CPUs each per-cpu iteration that takes a substantial
amount of time.

If only we had percpu counters that had a fixed, extremely low read
overhead that doesn't care about the number of CPUs in the
machine

Oh, wait, we do: percpu_counters.[ch].

This all seems like a counter implementation deficiency to me, not
an interface problem...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Vito Caputo
On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
>  wrote:
> > On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione  
> > wrote:
> >
> >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
> >> > This patch introduces a new /proc/stat2 file that is identical to the
> >> > regular 'stat' except that it zeroes all hard irq statistics. The new
> >> > file is a drop in replacement to stat for users that need performance.
> >>
> >> For a while now, I've been thinking over ways to improve the
> >> performance of collecting various bits of kernel information. I don't
> >> think that a proliferation of special-purpose named bag-of-fields file
> >> variants is the right answer, because even if you add a few info-file
> >> variants, you're still left with a situation where a given file
> >> provides a particular caller with too little or too much information.
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> 
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.
> 

I like the sysfs approach to organizing the data, and have wanted
fd-batching IO syscalls in other circumstances anyways, so I think
there's a good possibility of something along those lines getting added
eventually.

At a past employer I had written some backup software which had to
reassemble versioned files from chains of reverse differentials (think
rdiff-backup).  I had all the information needed to quickly construct a
multi-fd iovec to supply to a single batched readv syscall when
servicing versioned reads from a FUSE mount that involved a potentially
long chain of diffs, but no such syscall exists.  The more
differentials, the more fragmented the operation tended to be, requiring
increasing numbers of smaller reads across more files to reconstruct the
buffer.

The same thing would be useful for making reads from large numbers of
sysfs files less costly.  I presume proposing such a generally
applicable VFS API addition would meet less resistance than specialized
proc interfaces, perhaps naively :).

Regards,
Vito Caputo


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Waiman Long
On 11/06/2018 10:32 PM, Davidlohr Bueso wrote:
> On Tue, 06 Nov 2018, Andrew Morton wrote:
>
>> It would be interesting to know precisely which stat fields the
>> database-which-shall-not-be-named is looking for.  Then we could cook
>> up a very whizzy way of getting at the info.
>
> The ctxt field, afaik. In any case they have been able to work around
> the bottleneck. I'm not sure if that is the case for Waiman, however.
>

In my case, the customers just complain about the slowdown in reading
/proc/stat on some platforms vs. the others because some had many more
interrupt lines than the others. They didn't specifically call out what
they were looking at.

>>
>> A downside of the stat2 approach is that applications will need to be
>> rebuilt.  And hopefully when people do this, they'll open
>> "/etc/my-app-name/symlink-to-proc-stat" (or use per-application config)
>> so they won't need a rebuild when we add /proc/stat3!
>>
>> A /proc/change-how-stat-works tunable would avoid the need to rebuild
>> applications.  But if a system still has some applications which want
>> the irq info then that doesn't work.
>>
>> It's all very sad, really.
>>
>> btw,
>>
>>> +The stat2 file acts as a performance alternative to /proc/stat for
>>> workloads
>>> +and systems that care and are under heavy irq load. In order to to
>>> be completely
>>> +compatible, /proc/stat and /proc/stat2 are identical with the
>>> exception that the
>>> +later will show 0 for any (hard)irq-related fields. This refers
>>> particularly
>>
>> "latter"
>>
>>> +to the "intr" line and 'irq' column for that aggregate in the cpu
>>> line.
>>
>> btw2, please quantify "poor performance".  That helps us determine how
>> much we care about all of this!
>
> Up to a quarter of a second is what was reported as being spent every
> time
> /proc/stat is used. This is with 1k cores and 4k interrupts.
>
> Thanks,
> Davidlohr

Yes, the time spent will scale more or less linearly with the # of cores
and # of interrupts.

Thanks,
Longman



Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Daniel Colascione
On Wed, Nov 7, 2018 at 3:54 PM, Miklos Szeredi  wrote:
> On Wed, Nov 7, 2018 at 4:42 PM, Daniel Colascione  wrote:
>
>> configuration!" is something I've heard more than once. Who's to say
>> that sysfs is for exposing /proc/pid/stat,
>
> Patch is about /proc/stat not /proc/PID/stat.  Please revise your
> arguments based on that.

My argument stands. The considerations I'm discussing apply to both
process-specific and system-global information files.


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Miklos Szeredi
On Wed, Nov 7, 2018 at 4:42 PM, Daniel Colascione  wrote:

> configuration!" is something I've heard more than once. Who's to say
> that sysfs is for exposing /proc/pid/stat,

Patch is about /proc/stat not /proc/PID/stat.  Please revise your
arguments based on that.

Thanks,
Miklos


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Daniel Colascione
On Wed, Nov 7, 2018 at 10:03 AM, Miklos Szeredi  wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
>  wrote:
>> On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione  
>> wrote:
>>
>>> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
>>> > This patch introduces a new /proc/stat2 file that is identical to the
>>> > regular 'stat' except that it zeroes all hard irq statistics. The new
>>> > file is a drop in replacement to stat for users that need performance.
>>>
>>> For a while now, I've been thinking over ways to improve the
>>> performance of collecting various bits of kernel information. I don't
>>> think that a proliferation of special-purpose named bag-of-fields file
>>> variants is the right answer, because even if you add a few info-file
>>> variants, you're still left with a situation where a given file
>>> provides a particular caller with too little or too much information.
>>> I'd much rather move to a model in which userspace *explicitly* tells
>>> the kernel which fields it wants, with the kernel replying with just
>>> those particular fields, maybe in their raw binary representations.
>>> The ASCII-text bag-of-everything files would remain available for
>>> ad-hoc and non-performance critical use, but programs that cared about
>>> performance would have an efficient bypass. One concrete approach is
>>> to let users open up today's proc files and, instead of read(2)ing a
>>> text blob, use an ioctl to retrieve specified and targeted information
>>> of the sort that would normally be encoded in the text blob. Because
>>> callers would open the same file when using either the text or binary
>>> interfaces, little would have to change, and it'd be easy to implement
>>> fallbacks when a particular system doesn't support a particular
>>> fast-path ioctl.
>
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.

First of all, is sysfs even right? Some people, for whatever reason,
are extremely particular about the purposes of various virtual
filesystems. "No, sysfs is for exposing kernel objects, not
configuration!" is something I've heard more than once. Who's to say
that sysfs is for exposing /proc/pid/stat, which isn't a "kernel
object" itself? (A process is not its struct task.) More generally,
objections about APIs rooted in arcane kernel-internal considerations
about the purposes of various virtual filesystems --- procfs, sysfs,
debugfs, configfs --- makes the userspace API worse, because it
enshrines implementation details (is this thing a kobject or not?) in
public API. If I had my way, we'd have continued putting *everything*
in procfs and just make procfs the "I want stuff from the kernel" API.
Nobody in userspace cares about these filesystem divisions.

Second, slurping from a sysfs-style setup in which there's one file
per piece of information creates massive overhead, because there's
currently no way to open multiple paths with one system call and no
way to read from multiple FDs with one system call. If you want this
kind of setup to work, you need some kind of batched openat-and-read
system call mechanism. I think a simple "get information from this
procfs FD" system call --- something like statx --- is both cleaner
and more efficient. Plus, without a batch operation, there's no way to
achieve atomicity. It's perfectly reasonable for userspace to request
some bits of information about a process want these bits to be
consistent with each other. Now, such an API would be good to add, but
it's not enough, since a generic batched openat-and-read would still
have to go through VFS, create struct files, (probably) encode to
ASCII, and so on. Why should any system pay to do that much work when
the fields anyone might want could be obtained with a simple
copy_to_user?

Third, and finally, a sysfs-style tree for processes doesn't currently
exist. Would you propose having *two* *different* representations of
the process list as virtual filesystems? That's another pointless
exposure of internal kernel divisions in the user API. We already have
procfs. Let's just make it better.


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Miklos Szeredi
On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
 wrote:
> On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione  
> wrote:
>
>> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
>> > This patch introduces a new /proc/stat2 file that is identical to the
>> > regular 'stat' except that it zeroes all hard irq statistics. The new
>> > file is a drop in replacement to stat for users that need performance.
>>
>> For a while now, I've been thinking over ways to improve the
>> performance of collecting various bits of kernel information. I don't
>> think that a proliferation of special-purpose named bag-of-fields file
>> variants is the right answer, because even if you add a few info-file
>> variants, you're still left with a situation where a given file
>> provides a particular caller with too little or too much information.
>> I'd much rather move to a model in which userspace *explicitly* tells
>> the kernel which fields it wants, with the kernel replying with just
>> those particular fields, maybe in their raw binary representations.
>> The ASCII-text bag-of-everything files would remain available for
>> ad-hoc and non-performance critical use, but programs that cared about
>> performance would have an efficient bypass. One concrete approach is
>> to let users open up today's proc files and, instead of read(2)ing a
>> text blob, use an ioctl to retrieve specified and targeted information
>> of the sort that would normally be encoded in the text blob. Because
>> callers would open the same file when using either the text or binary
>> interfaces, little would have to change, and it'd be easy to implement
>> fallbacks when a particular system doesn't support a particular
>> fast-path ioctl.

Please.   Sysfs, with the one value per file rule, was created exactly
for the purpose of eliminating these sort of problems with procfs.  So
instead of inventing special purpose interfaces for proc, just make
the info available in sysfs, if not already available.

Thanks,
Miklos


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-06 Thread Davidlohr Bueso

On Tue, 06 Nov 2018, Andrew Morton wrote:


It would be interesting to know precisely which stat fields the
database-which-shall-not-be-named is looking for.  Then we could cook
up a very whizzy way of getting at the info.


The ctxt field, afaik. In any case they have been able to work around
the bottleneck. I'm not sure if that is the case for Waiman, however.



A downside of the stat2 approach is that applications will need to be
rebuilt.  And hopefully when people do this, they'll open
"/etc/my-app-name/symlink-to-proc-stat" (or use per-application config)
so they won't need a rebuild when we add /proc/stat3!

A /proc/change-how-stat-works tunable would avoid the need to rebuild
applications.  But if a system still has some applications which want
the irq info then that doesn't work.

It's all very sad, really.

btw,


+The stat2 file acts as a performance alternative to /proc/stat for workloads
+and systems that care and are under heavy irq load. In order to to be 
completely
+compatible, /proc/stat and /proc/stat2 are identical with the exception that 
the
+later will show 0 for any (hard)irq-related fields. This refers particularly


"latter"


+to the "intr" line and 'irq' column for that aggregate in the cpu line.


btw2, please quantify "poor performance".  That helps us determine how
much we care about all of this!


Up to a quarter of a second is what was reported as being spent every time
/proc/stat is used. This is with 1k cores and 4k interrupts.

Thanks,
Davidlohr


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-06 Thread Andrew Morton
On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione  wrote:

> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
> > This patch introduces a new /proc/stat2 file that is identical to the
> > regular 'stat' except that it zeroes all hard irq statistics. The new
> > file is a drop in replacement to stat for users that need performance.
> 
> For a while now, I've been thinking over ways to improve the
> performance of collecting various bits of kernel information. I don't
> think that a proliferation of special-purpose named bag-of-fields file
> variants is the right answer, because even if you add a few info-file
> variants, you're still left with a situation where a given file
> provides a particular caller with too little or too much information.
> I'd much rather move to a model in which userspace *explicitly* tells
> the kernel which fields it wants, with the kernel replying with just
> those particular fields, maybe in their raw binary representations.
> The ASCII-text bag-of-everything files would remain available for
> ad-hoc and non-performance critical use, but programs that cared about
> performance would have an efficient bypass. One concrete approach is
> to let users open up today's proc files and, instead of read(2)ing a
> text blob, use an ioctl to retrieve specified and targeted information
> of the sort that would normally be encoded in the text blob. Because
> callers would open the same file when using either the text or binary
> interfaces, little would have to change, and it'd be easy to implement
> fallbacks when a particular system doesn't support a particular
> fast-path ioctl.

Yup.  There are better ways of getting information out of the kernel,
to say the least.

It would be interesting to know precisely which stat fields the
database-which-shall-not-be-named is looking for.  Then we could cook
up a very whizzy way of getting at the info.

A downside of the stat2 approach is that applications will need to be
rebuilt.  And hopefully when people do this, they'll open
"/etc/my-app-name/symlink-to-proc-stat" (or use per-application config)
so they won't need a rebuild when we add /proc/stat3!

A /proc/change-how-stat-works tunable would avoid the need to rebuild
applications.  But if a system still has some applications which want
the irq info then that doesn't work.

It's all very sad, really.

btw,

> +The stat2 file acts as a performance alternative to /proc/stat for workloads
> +and systems that care and are under heavy irq load. In order to to be 
> completely
> +compatible, /proc/stat and /proc/stat2 are identical with the exception that 
> the
> +later will show 0 for any (hard)irq-related fields. This refers particularly

"latter"

> +to the "intr" line and 'irq' column for that aggregate in the cpu line.

btw2, please quantify "poor performance".  That helps us determine how
much we care about all of this!


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-30 Thread Davidlohr Bueso

On Tue, 30 Oct 2018, Vito Caputo wrote:


If you create /proc/stat2 to omit interrupts, do we then create
/proc/stat3 to omit CPUs when just interrupts are of interest to the
application running on a 256-cpu machine?


Be real, this is a bogus argument. As mentioned, stat2 is named as such
because it's a replacement for a better common case, not some random increment.

Thanks,
Davidlohr


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-30 Thread Vito Caputo
On Tue, Oct 30, 2018 at 11:57:56AM -0700, Davidlohr Bueso wrote:
> On Mon, 29 Oct 2018, Vito Caputo wrote:
> 
> > I'm definitely not in favor of just adding another stat file that is the
> > same format as the existing one with the intrs zeroed out.  It's a dirty
> > hack; fine for your local needs but too gross for upstream IMHO.
> 
> I suspect very few users of /proc/stat actually use the irq fields in the
> first place. So the common case ends up doing unnecessary operations. The
> stat2 approach is not perfect, but I think it's the best approach so far.
> This sort of renaming is not uncommon when we cannot break userspace, and
> its not like procfs is not already far contaminated already.
> 
> There are not enough users that care about this stuff, afaik. What you
> suggest sounds like a lot of over-engineering.
> 

What you suggest sounds like a kludge with zero engineering at all.

My suggestion might be stupid, insofar as the same thing can be achieved
using ioctls without assigning surprising semantics to an existing vfs
api.  But I think the spirit of the suggestion is a reasonable
compromise, if /proc/stat is not a deprecated interface and it's
too difficult to make everything it collects sufficiently efficient on
any size machine.

If you create /proc/stat2 to omit interrupts, do we then create
/proc/stat3 to omit CPUs when just interrupts are of interest to the
application running on a 256-cpu machine?

Furthermore, your rationale of procfs already being contaminated is an
active embrace of the broken windows effect.  If you recognize something
is flawed, it's cause to work harder to improve the situation when
working in the flawed area, not worsen it.

Regards,
Vito Caputo


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-30 Thread Davidlohr Bueso

On Mon, 29 Oct 2018, Vito Caputo wrote:


I'm definitely not in favor of just adding another stat file that is the
same format as the existing one with the intrs zeroed out.  It's a dirty
hack; fine for your local needs but too gross for upstream IMHO.


I suspect very few users of /proc/stat actually use the irq fields in the
first place. So the common case ends up doing unnecessary operations. The
stat2 approach is not perfect, but I think it's the best approach so far.
This sort of renaming is not uncommon when we cannot break userspace, and
its not like procfs is not already far contaminated already.

There are not enough users that care about this stuff, afaik. What you
suggest sounds like a lot of over-engineering.

Thanks,
Davidlohr


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Vito Caputo
On Mon, Oct 29, 2018 at 11:04:45PM +, Daniel Colascione wrote:
> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
> > This patch introduces a new /proc/stat2 file that is identical to the
> > regular 'stat' except that it zeroes all hard irq statistics. The new
> > file is a drop in replacement to stat for users that need performance.
> 
> For a while now, I've been thinking over ways to improve the
> performance of collecting various bits of kernel information. I don't
> think that a proliferation of special-purpose named bag-of-fields file
> variants is the right answer, because even if you add a few info-file
> variants, you're still left with a situation where a given file
> provides a particular caller with too little or too much information.
> I'd much rather move to a model in which userspace *explicitly* tells
> the kernel which fields it wants, with the kernel replying with just
> those particular fields, maybe in their raw binary representations.
> The ASCII-text bag-of-everything files would remain available for
> ad-hoc and non-performance critical use, but programs that cared about
> performance would have an efficient bypass. One concrete approach is
> to let users open up today's proc files and, instead of read(2)ing a
> text blob, use an ioctl to retrieve specified and targeted information
> of the sort that would normally be encoded in the text blob. Because
> callers would open the same file when using either the text or binary
> interfaces, little would have to change, and it'd be easy to implement
> fallbacks when a particular system doesn't support a particular
> fast-path ioctl.


We have two extremes of granularity in the /proc and /sys virtual
filesystems today:

On procfs there's these legacy files which aggregate loosely-related
system information, and in cases where you actually want most of what's
provided, it's a nice optimization because you can sample it all in a
single pread() call.

On sysfs the granularity is much finer with it being fairly common to
find a file-per-datum.  This has other advantages, like not needing to
parse snowflake formats which sometimes varied across kernel versions
like in procfs, or needing to burden the kernel to produce more
information than necessary.

But anyone who has written tools trying to sample large subsets of the
granular information in sysfs at a high rate will know how quickly it
becomes rather costly in terms of system calls.

The last time I went down this path, I wished there were a system call
like readv() which accepted a vector a new iovec type specifying an fd.

Then the sysfs model could be made a more efficient by coalescing all
the required read syscalls into a single megaread bundling all the
relevant fds that are simply kept open and reused.

If we had such a readv() variant, the sysfs granular model could be used
to granularly expose all the information we currently expose in /proc,
while still being relatively efficient in terms of system calls per
sample.  Sure you still have to lookup and open all the files of
interest, but that only needs to occur once at initialization.

Regards,
Vito Caputo


Re: Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Alexey Dobriyan
On Mon, Oct 29, 2018 at 11:40:47PM +, Daniel Colascione wrote:
> On Mon, Oct 29, 2018 at 11:34 PM, Alexey Dobriyan  wrote:
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> >
> > You've just reinvented systems calls.
> 
> I don't know why you say so. There are important benefits that come
> from using an ioctl on a proc file FD instead of a plain system call.
> Procfs files have file permissions,auditing, SCM_RIGHTS-ability, PID
> race immunity, and other things that you wouldn't get from a plain
> "get this information about this PID" system call.

This whole thread started because /proc/stat is slow and every number in
/proc/stat is system global.

If you continue adding stuff to /proc, one day someone will notice that
core VFS adds considerable overhead, at this point there is nothing
anyone could do.

I'd strongly advise to look at what this DB actually needs and deliver
just that.

Very little of other things apply to /proc/stat:
* system call auditing exists,
* /proc/stat is world readable and continues to be so,
* thus passing descriptor around is pretty useless,
* $PID race doesn't apply.

Additionally passing descriptors feels like party trick.
I suspect that's not how people use statistics in /proc: they run
processes and one priviledged enough monitoring daemon collects data,
otherwise userspace needs to cooperate with monitoring userspace
which of course doesn't happen.

PID race is solved by giving out descriptors which pin "struct pid".
Which is how the race is solved currently: dentry pins inode, inode
pins "struct pid".


Re: Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Daniel Colascione
On Mon, Oct 29, 2018 at 11:34 PM, Alexey Dobriyan  wrote:
>> I'd much rather move to a model in which userspace *explicitly* tells
>> the kernel which fields it wants, with the kernel replying with just
>> those particular fields, maybe in their raw binary representations.
>> The ASCII-text bag-of-everything files would remain available for
>> ad-hoc and non-performance critical use, but programs that cared about
>> performance would have an efficient bypass. One concrete approach is
>> to let users open up today's proc files and, instead of read(2)ing a
>> text blob, use an ioctl to retrieve specified and targeted information
>> of the sort that would normally be encoded in the text blob. Because
>> callers would open the same file when using either the text or binary
>> interfaces, little would have to change, and it'd be easy to implement
>> fallbacks when a particular system doesn't support a particular
>> fast-path ioctl.
>
> You've just reinvented systems calls.

I don't know why you say so. There are important benefits that come
from using an ioctl on a proc file FD instead of a plain system call.
Procfs files have file permissions, auditing, SCM_RIGHTS-ability, PID
race immunity, and other things that you wouldn't get from a plain
"get this information about this PID" system call.


Re: Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Alexey Dobriyan
> I'd much rather move to a model in which userspace *explicitly* tells
> the kernel which fields it wants, with the kernel replying with just
> those particular fields, maybe in their raw binary representations.
> The ASCII-text bag-of-everything files would remain available for
> ad-hoc and non-performance critical use, but programs that cared about
> performance would have an efficient bypass. One concrete approach is
> to let users open up today's proc files and, instead of read(2)ing a
> text blob, use an ioctl to retrieve specified and targeted information
> of the sort that would normally be encoded in the text blob. Because
> callers would open the same file when using either the text or binary
> interfaces, little would have to change, and it'd be easy to implement
> fallbacks when a particular system doesn't support a particular
> fast-path ioctl.

You've just reinvented systems calls.

I suspect the DB in question cares about CPU related numbers and nothing
else which can be nicely split from the rest of /proc/stat.


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Daniel Colascione
On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
> This patch introduces a new /proc/stat2 file that is identical to the
> regular 'stat' except that it zeroes all hard irq statistics. The new
> file is a drop in replacement to stat for users that need performance.

For a while now, I've been thinking over ways to improve the
performance of collecting various bits of kernel information. I don't
think that a proliferation of special-purpose named bag-of-fields file
variants is the right answer, because even if you add a few info-file
variants, you're still left with a situation where a given file
provides a particular caller with too little or too much information.
I'd much rather move to a model in which userspace *explicitly* tells
the kernel which fields it wants, with the kernel replying with just
those particular fields, maybe in their raw binary representations.
The ASCII-text bag-of-everything files would remain available for
ad-hoc and non-performance critical use, but programs that cared about
performance would have an efficient bypass. One concrete approach is
to let users open up today's proc files and, instead of read(2)ing a
text blob, use an ioctl to retrieve specified and targeted information
of the sort that would normally be encoded in the text blob. Because
callers would open the same file when using either the text or binary
interfaces, little would have to change, and it'd be easy to implement
fallbacks when a particular system doesn't support a particular
fast-path ioctl.


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Vito Caputo
On Mon, Oct 29, 2018 at 05:35:15PM -0400, Waiman Long wrote:
> On 10/29/2018 05:23 PM, Vito Caputo wrote:
> > On Mon, Oct 29, 2018 at 04:59:03PM -0400, Waiman Long wrote:
> >> On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
> >>> On Mon, 29 Oct 2018, Waiman Long wrote:
> >>>
>  BTW, since you are making stat2 compatible with stat, will that be
>  easier from the user API perspective if we use a sysctl parameter to
>  turn on and off IRQs reporting for /proc/stat, for example?
> >>> For one /proc/stat is also common for debugging envs (ie: performance)
> >>> and I fear that if a tunnable modifies the behavior of the output, we
> >>> it might never be usable again (at least not without having users also
> >>> now consider the systctl parameter). Making it dynamic I think is not
> >>> worth it.
> >>>
> >>> Thanks,
> >>> Davidlohr
> >> This is just a matter if it is easier for users to modify their code to
> >> use /proc/stat2 or turning on a sysctl parameter. Again, this will
> >> certainly depend on the circumstances.
> >>
> > I wonder if it makes sense to introduce a more general mechanism for
> > toggling subfields in proc files.  Extended attributes could probably be
> > abused to key the subfields, write a 1 or 0 to well-known names for
> > toggling them on a per-fd basis via fsetxattr.
> >
> > For this particular case the program would just have to add code like:
> >
> > int zero = 0;
> > fsetxattr(proc_stat_fd, "intr", &zero, sizeof(zero), XATTR_REPLACE);
> >
> > Just putting it out there.  I've certainly wanted an ability to noop
> > fields before where I was polling proc frequently and skipping the bulk
> > of what was there but syscpu was still rather high.
> >
> > I'm definitely not in favor of just adding another stat file that is the
> > same format as the existing one with the intrs zeroed out.  It's a dirty
> > hack; fine for your local needs but too gross for upstream IMHO.
> >
> > Regards,
> > Vito Caputo
> 
> Does procfs allow extended attributes? I am not sure if using extended
> attributes is a usual practice for doing this kind of control on a
> procfs file.
> 

I'm not aware of any such usage, but I didn't dig into the code to see if
there were already conflicting xattr users.

If this did turn out to be a reasonable approach, it would probably be
wise to prefix the key strings with a namespace in case future uses come
up.  Like "filter:intr" or something along those lines.

WRT procfs support for xattr, if it's not already implemented it's not
difficult to add.  The important factor is this utilizes an existing vfs
api, there's nothing about procfs prohibiting xattr handling.

Regards,
Vito Caputo


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Waiman Long
On 10/29/2018 05:23 PM, Vito Caputo wrote:
> On Mon, Oct 29, 2018 at 04:59:03PM -0400, Waiman Long wrote:
>> On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
>>> On Mon, 29 Oct 2018, Waiman Long wrote:
>>>
 BTW, since you are making stat2 compatible with stat, will that be
 easier from the user API perspective if we use a sysctl parameter to
 turn on and off IRQs reporting for /proc/stat, for example?
>>> For one /proc/stat is also common for debugging envs (ie: performance)
>>> and I fear that if a tunnable modifies the behavior of the output, we
>>> it might never be usable again (at least not without having users also
>>> now consider the systctl parameter). Making it dynamic I think is not
>>> worth it.
>>>
>>> Thanks,
>>> Davidlohr
>> This is just a matter if it is easier for users to modify their code to
>> use /proc/stat2 or turning on a sysctl parameter. Again, this will
>> certainly depend on the circumstances.
>>
> I wonder if it makes sense to introduce a more general mechanism for
> toggling subfields in proc files.  Extended attributes could probably be
> abused to key the subfields, write a 1 or 0 to well-known names for
> toggling them on a per-fd basis via fsetxattr.
>
> For this particular case the program would just have to add code like:
>
> int   zero = 0;
> fsetxattr(proc_stat_fd, "intr", &zero, sizeof(zero), XATTR_REPLACE);
>
> Just putting it out there.  I've certainly wanted an ability to noop
> fields before where I was polling proc frequently and skipping the bulk
> of what was there but syscpu was still rather high.
>
> I'm definitely not in favor of just adding another stat file that is the
> same format as the existing one with the intrs zeroed out.  It's a dirty
> hack; fine for your local needs but too gross for upstream IMHO.
>
> Regards,
> Vito Caputo

Does procfs allow extended attributes? I am not sure if using extended
attributes is a usual practice for doing this kind of control on a
procfs file.

Cheers,
Longman



Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Vito Caputo
On Mon, Oct 29, 2018 at 04:59:03PM -0400, Waiman Long wrote:
> On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
> > On Mon, 29 Oct 2018, Waiman Long wrote:
> >
> >> BTW, since you are making stat2 compatible with stat, will that be
> >> easier from the user API perspective if we use a sysctl parameter to
> >> turn on and off IRQs reporting for /proc/stat, for example?
> >
> > For one /proc/stat is also common for debugging envs (ie: performance)
> > and I fear that if a tunnable modifies the behavior of the output, we
> > it might never be usable again (at least not without having users also
> > now consider the systctl parameter). Making it dynamic I think is not
> > worth it.
> >
> > Thanks,
> > Davidlohr
> 
> This is just a matter if it is easier for users to modify their code to
> use /proc/stat2 or turning on a sysctl parameter. Again, this will
> certainly depend on the circumstances.
> 

I wonder if it makes sense to introduce a more general mechanism for
toggling subfields in proc files.  Extended attributes could probably be
abused to key the subfields, write a 1 or 0 to well-known names for
toggling them on a per-fd basis via fsetxattr.

For this particular case the program would just have to add code like:

int zero = 0;
fsetxattr(proc_stat_fd, "intr", &zero, sizeof(zero), XATTR_REPLACE);

Just putting it out there.  I've certainly wanted an ability to noop
fields before where I was polling proc frequently and skipping the bulk
of what was there but syscpu was still rather high.

I'm definitely not in favor of just adding another stat file that is the
same format as the existing one with the intrs zeroed out.  It's a dirty
hack; fine for your local needs but too gross for upstream IMHO.

Regards,
Vito Caputo


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Waiman Long
On 10/29/2018 03:25 PM, Davidlohr Bueso wrote:
> A recent report from a large database vendor which I shall not name
> shows concerns about poor performance when consuming /proc/stat info.
> Particularly  kstat_irq() pops up in the profiles and most time is
> being spent there. The overall system is under a lot of irqs and
> almost 1k cores, thus this comes to little surprise.
>
> Granted that procfs in general is not known for its performance,
> nor designed for it, for that matter. Some users, however may be able
> to overcome this performance limitation, some not. Therefore it isn't
> bad having a kernel option for users that don't want any hard irq info
> -- and care enough about this.
>
> This patch introduces a new /proc/stat2 file that is identical to the
> regular 'stat' except that it zeroes all hard irq statistics. The new
> file is a drop in replacement to stat for users that need performance.
>
> The stat file is not touched, of course -- this was also previously
> suggested by Waiman:
> https://lore.kernel.org/lkml/1524166562-5644-1-git-send-email-long...@redhat.com/
>
> Signed-off-by: Davidlohr Bueso 
> ---
>  Documentation/filesystems/proc.txt | 12 +++---
>  fs/proc/stat.c | 45 
> --
>  2 files changed, 47 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..563b01decb1e 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -27,7 +27,7 @@ Table of Contents
>1.5SCSI info
>1.6Parallel port info in /proc/parport
>1.7TTY info in /proc/tty
> -  1.8Miscellaneous kernel statistics in /proc/stat
> +  1.8Miscellaneous kernel statistics in /proc/stat and /proc/stat2
>1.9Ext4 file system parameters
>  
>2  Modifying System Parameters
> @@ -140,6 +140,7 @@ Table 1-1: Process specific entries in /proc
>   mem Memory held by this process
>   rootLink to the root directory of this process
>   statProcess status
> + stat2   Process status without irq information
>   statm   Process memory status information
>   status  Process status in human readable form
>   wchan   Present with CONFIG_KALLSYMS=y: it shows the kernel 
> function
> @@ -1301,8 +1302,8 @@ To see  which  tty's  are  currently in use, you can 
> simply look into the file
>unknown  /dev/tty41-63 console 
>  
>  
> -1.8 Miscellaneous kernel statistics in /proc/stat
> --
> +1.8 Miscellaneous kernel statistics in /proc/stat and /proc/stat2
> +-
>  
>  Various pieces   of  information about  kernel activity  are  available in 
> the
>  /proc/stat file.  All  of  the numbers reported  in  this file are  
> aggregates
> @@ -1371,6 +1372,11 @@ of the possible system softirqs. The first column is 
> the total of all
>  softirqs serviced; each subsequent column is the total for that particular
>  softirq.
>  
> +The stat2 file acts as a performance alternative to /proc/stat for workloads

A "performant alternative", right?

-Longman



Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Waiman Long
On 10/29/2018 04:38 PM, Davidlohr Bueso wrote:
> On Mon, 29 Oct 2018, Waiman Long wrote:
>
>> BTW, since you are making stat2 compatible with stat, will that be
>> easier from the user API perspective if we use a sysctl parameter to
>> turn on and off IRQs reporting for /proc/stat, for example?
>
> For one /proc/stat is also common for debugging envs (ie: performance)
> and I fear that if a tunnable modifies the behavior of the output, we
> it might never be usable again (at least not without having users also
> now consider the systctl parameter). Making it dynamic I think is not
> worth it.
>
> Thanks,
> Davidlohr

This is just a matter if it is easier for users to modify their code to
use /proc/stat2 or turning on a sysctl parameter. Again, this will
certainly depend on the circumstances.

Cheers,
Longman




Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Davidlohr Bueso

On Mon, 29 Oct 2018, Waiman Long wrote:


BTW, since you are making stat2 compatible with stat, will that be
easier from the user API perspective if we use a sysctl parameter to
turn on and off IRQs reporting for /proc/stat, for example?


For one /proc/stat is also common for debugging envs (ie: performance)
and I fear that if a tunnable modifies the behavior of the output, we
it might never be usable again (at least not without having users also
now consider the systctl parameter). Making it dynamic I think is not
worth it.

Thanks,
Davidlohr


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Waiman Long
On 10/29/2018 04:00 PM, Davidlohr Bueso wrote:
> On Mon, 29 Oct 2018, Waiman Long wrote:
>
>> I am wondering if /proc/stat_noirqs will be a more descriptive name of
>> the intent of this new procfs file or we should just go with the more
>> generic stat2 name.
>
> The reason why I went with '2' instead of a more rescriptive name
> was that I think of the call as a drop-in replacement/extention to
> stat. Therefore the same fields are maintained, otherwise with
> stat_noirqs
> I feel like instead of zeroing out, they should just be removed.
>
> But otoh, I have no strong objection in renaming either.
>
> Thanks,
> Davidlohr

I am just questioning the rationale for the stat2 name. I am not
advocating to use stat_noirqs neither.

BTW, since you are making stat2 compatible with stat, will that be
easier from the user API perspective if we use a sysctl parameter to
turn on and off IRQs reporting for /proc/stat, for example?

I know that there are pros and cons for each approach, I just want to
consider all the available options and choose the best one.

Cheers,
Longman





Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Alexey Dobriyan
> I am wondering if /proc/stat_noirqs will be a more descriptive name of
> the intent of this new procfs file or we should just go with the more
> generic stat2 name.

What would you do if someone asks for /proc/stat without CPU numbers?


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Davidlohr Bueso

On Mon, 29 Oct 2018, Waiman Long wrote:


I am wondering if /proc/stat_noirqs will be a more descriptive name of
the intent of this new procfs file or we should just go with the more
generic stat2 name.


The reason why I went with '2' instead of a more rescriptive name
was that I think of the call as a drop-in replacement/extention to
stat. Therefore the same fields are maintained, otherwise with stat_noirqs
I feel like instead of zeroing out, they should just be removed.

But otoh, I have no strong objection in renaming either.

Thanks,
Davidlohr


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Waiman Long
On 10/29/2018 03:25 PM, Davidlohr Bueso wrote:
> A recent report from a large database vendor which I shall not name
> shows concerns about poor performance when consuming /proc/stat info.
> Particularly  kstat_irq() pops up in the profiles and most time is
> being spent there. The overall system is under a lot of irqs and
> almost 1k cores, thus this comes to little surprise.
>
> Granted that procfs in general is not known for its performance,
> nor designed for it, for that matter. Some users, however may be able
> to overcome this performance limitation, some not. Therefore it isn't
> bad having a kernel option for users that don't want any hard irq info
> -- and care enough about this.
>
> This patch introduces a new /proc/stat2 file that is identical to the
> regular 'stat' except that it zeroes all hard irq statistics. The new
> file is a drop in replacement to stat for users that need performance.
>
> The stat file is not touched, of course -- this was also previously
> suggested by Waiman:
> https://lore.kernel.org/lkml/1524166562-5644-1-git-send-email-long...@redhat.com/
>
> Signed-off-by: Davidlohr Bueso 

I am wondering if /proc/stat_noirqs will be a more descriptive name of
the intent of this new procfs file or we should just go with the more
generic stat2 name.

Cheers,
Longman

> ---
>  Documentation/filesystems/proc.txt | 12 +++---
>  fs/proc/stat.c | 45 
> --
>  2 files changed, 47 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..563b01decb1e 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -27,7 +27,7 @@ Table of Contents
>1.5SCSI info
>1.6Parallel port info in /proc/parport
>1.7TTY info in /proc/tty
> -  1.8Miscellaneous kernel statistics in /proc/stat
> +  1.8Miscellaneous kernel statistics in /proc/stat and /proc/stat2
>1.9Ext4 file system parameters
>  
>2  Modifying System Parameters
> @@ -140,6 +140,7 @@ Table 1-1: Process specific entries in /proc
>   mem Memory held by this process
>   rootLink to the root directory of this process
>   statProcess status
> + stat2   Process status without irq information
>   statm   Process memory status information
>   status  Process status in human readable form
>   wchan   Present with CONFIG_KALLSYMS=y: it shows the kernel 
> function
> @@ -1301,8 +1302,8 @@ To see  which  tty's  are  currently in use, you can 
> simply look into the file
>unknown  /dev/tty41-63 console 
>  
>  
> -1.8 Miscellaneous kernel statistics in /proc/stat
> --
> +1.8 Miscellaneous kernel statistics in /proc/stat and /proc/stat2
> +-
>  
>  Various pieces   of  information about  kernel activity  are  available in 
> the
>  /proc/stat file.  All  of  the numbers reported  in  this file are  
> aggregates
> @@ -1371,6 +1372,11 @@ of the possible system softirqs. The first column is 
> the total of all
>  softirqs serviced; each subsequent column is the total for that particular
>  softirq.
>  
> +The stat2 file acts as a performance alternative to /proc/stat for workloads
> +and systems that care and are under heavy irq load. In order to to be 
> completely
> +compatible, /proc/stat and /proc/stat2 are identical with the exception that 
> the
> +later will show 0 for any (hard)irq-related fields. This refers particularly
> +to the "intr" line and 'irq' column for that aggregate in the cpu line.
>  
>  1.9 Ext4 file system parameters
>  ---
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 535eda7857cf..349040270003 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -79,7 +79,7 @@ static u64 get_iowait_time(int cpu)
>  
>  #endif
>  
> -static int show_stat(struct seq_file *p, void *v)
> +static int __show_stat(struct seq_file *p, void *v, bool irq_stats)
>  {
>   int i, j;
>   u64 user, nice, system, idle, iowait, irq, softirq, steal;
> @@ -100,13 +100,17 @@ static int show_stat(struct seq_file *p, void *v)
>   system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
>   idle += get_idle_time(i);
>   iowait += get_iowait_time(i);
> - irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
>   softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
>   steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
>   guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
>   guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
> - sum += kstat_cpu_irqs_sum(i);
> - sum += arch_irq_stat_cpu(i);
> +
> + if (irq_stats) {
> +   

[PATCH] fs/proc: introduce /proc/stat2 file

2018-10-29 Thread Davidlohr Bueso
A recent report from a large database vendor which I shall not name
shows concerns about poor performance when consuming /proc/stat info.
Particularly  kstat_irq() pops up in the profiles and most time is
being spent there. The overall system is under a lot of irqs and
almost 1k cores, thus this comes to little surprise.

Granted that procfs in general is not known for its performance,
nor designed for it, for that matter. Some users, however may be able
to overcome this performance limitation, some not. Therefore it isn't
bad having a kernel option for users that don't want any hard irq info
-- and care enough about this.

This patch introduces a new /proc/stat2 file that is identical to the
regular 'stat' except that it zeroes all hard irq statistics. The new
file is a drop in replacement to stat for users that need performance.

The stat file is not touched, of course -- this was also previously
suggested by Waiman:
https://lore.kernel.org/lkml/1524166562-5644-1-git-send-email-long...@redhat.com/

Signed-off-by: Davidlohr Bueso 
---
 Documentation/filesystems/proc.txt | 12 +++---
 fs/proc/stat.c | 45 --
 2 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 12a5e6e693b6..563b01decb1e 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -27,7 +27,7 @@ Table of Contents
   1.5  SCSI info
   1.6  Parallel port info in /proc/parport
   1.7  TTY info in /proc/tty
-  1.8  Miscellaneous kernel statistics in /proc/stat
+  1.8  Miscellaneous kernel statistics in /proc/stat and /proc/stat2
   1.9  Ext4 file system parameters
 
   2Modifying System Parameters
@@ -140,6 +140,7 @@ Table 1-1: Process specific entries in /proc
  mem   Memory held by this process
  root  Link to the root directory of this process
  stat  Process status
+ stat2 Process status without irq information
  statm Process memory status information
  statusProcess status in human readable form
  wchan Present with CONFIG_KALLSYMS=y: it shows the kernel function
@@ -1301,8 +1302,8 @@ To see  which  tty's  are  currently in use, you can 
simply look into the file
   unknown  /dev/tty41-63 console 
 
 
-1.8 Miscellaneous kernel statistics in /proc/stat
--
+1.8 Miscellaneous kernel statistics in /proc/stat and /proc/stat2
+-
 
 Various pieces   of  information about  kernel activity  are  available in the
 /proc/stat file.  All  of  the numbers reported  in  this file are  aggregates
@@ -1371,6 +1372,11 @@ of the possible system softirqs. The first column is the 
total of all
 softirqs serviced; each subsequent column is the total for that particular
 softirq.
 
+The stat2 file acts as a performance alternative to /proc/stat for workloads
+and systems that care and are under heavy irq load. In order to to be 
completely
+compatible, /proc/stat and /proc/stat2 are identical with the exception that 
the
+later will show 0 for any (hard)irq-related fields. This refers particularly
+to the "intr" line and 'irq' column for that aggregate in the cpu line.
 
 1.9 Ext4 file system parameters
 ---
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 535eda7857cf..349040270003 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -79,7 +79,7 @@ static u64 get_iowait_time(int cpu)
 
 #endif
 
-static int show_stat(struct seq_file *p, void *v)
+static int __show_stat(struct seq_file *p, void *v, bool irq_stats)
 {
int i, j;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
@@ -100,13 +100,17 @@ static int show_stat(struct seq_file *p, void *v)
system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
idle += get_idle_time(i);
iowait += get_iowait_time(i);
-   irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
-   sum += kstat_cpu_irqs_sum(i);
-   sum += arch_irq_stat_cpu(i);
+
+   if (irq_stats) {
+   irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
+
+   sum += kstat_cpu_irqs_sum(i);
+   sum += arch_irq_stat_cpu(i);
+   }
 
for (j = 0; j < NR_SOFTIRQS; j++) {
unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -115,7 +119,9 @@ static int show_stat(struct seq_file *p, void *v)
sum_softirq += softirq_stat;
}
}
-   s