Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-06-04 Thread Amit Kucheria
On Tue, May 5, 2020 at 3:07 AM David Rientjes  wrote:
>
> On Mon, 4 May 2020, Emanuele Giuseppe Esposito wrote:
>
> > There is currently no common way for Linux kernel subsystems to expose
> > statistics to userspace shared throughout the Linux kernel; subsystems
> > have to take care of gathering and displaying statistics by themselves,
> > for example in the form of files in debugfs. For example KVM has its own
> > code section that takes care of this in virt/kvm/kvm_main.c, where it sets
> > up debugfs handlers for displaying values and aggregating them from
> > various subfolders to obtain information about the system state (i.e.
> > displaying the total number of exits, calculated by summing all exits of
> > all cpus of all running virtual machines).
> >
> > Allowing each section of the kernel to do so has two disadvantages. First,
> > it will introduce redundant code. Second, debugfs is anyway not the right
> > place for statistics (for example it is affected by lockdown)
> >
> > In this patch series I introduce statsfs, a synthetic ram-based virtual
> > filesystem that takes care of gathering and displaying statistics for the
> > Linux kernel subsystems.
> >
>
> This is exciting, we have been looking in the same area recently.  Adding
> Jonathan Adams .
>
> In your diffstat, one thing I notice that is omitted: an update to
> Documentation/* :)  Any chance of getting some proposed Documentation/
> updates with structure of the fs, the per subsystem breakdown, and best
> practices for managing the stats from the kernel level?
>
> > The file system is mounted on /sys/kernel/stats and would be already used
> > by kvm. Statsfs was initially introduced by Paolo Bonzini [1].
> >
> > Statsfs offers a generic and stable API, allowing any kind of
> > directory/file organization and supporting multiple kind of aggregations
> > (not only sum, but also average, max, min and count_zero) and data types
> > (all unsigned and signed types plus boolean). The implementation, which is
> > a generalization of KVM’s debugfs statistics code, takes care of gathering
> > and displaying information at run time; users only need to specify the
> > values to be included in each source.
> >
> > Statsfs would also be a different mountpoint from debugfs, and would not
> > suffer from limited access due to the security lock down patches. Its main
> > function is to display each statistics as a file in the desired folder
> > hierarchy defined through the API. Statsfs files can be read, and possibly
> > cleared if their file mode allows it.
> >
> > Statsfs has two main components: the public API defined by
> > include/linux/statsfs.h, and the virtual file system which should end up
> > in /sys/kernel/stats.
> >
> > The API has two main elements, values and sources. Kernel subsystems like
> > KVM can use the API to create a source, add child
> > sources/values/aggregates and register it to the root source (that on the
> > virtual fs would be /sys/kernel/statsfs).
> >
> > Sources are created via statsfs_source_create(), and each source becomes a
> > directory in the file system. Sources form a parent-child relationship;
> > root sources are added to the file system via statsfs_source_register().
> > Every other source is added to or removed from a parent through the
> > statsfs_source_add_subordinate and statsfs_source_remote_subordinate APIs.
> > Once a source is created and added to the tree (via add_subordinate), it
> > will be used to compute aggregate values in the parent source.
> >
> > Values represent quantites that are gathered by the statsfs user. Examples
> > of values include the number of vm exits of a given kind, the amount of
> > memory used by some data structure, the length of the longest hash table
> > chain, or anything like that. Values are defined with the
> > statsfs_source_add_values function. Each value is defined by a struct
> > statsfs_value; the same statsfs_value can be added to many different
> > sources. A value can be considered "simple" if it fetches data from a
> > user-provided location, or "aggregate" if it groups all values in the
> > subordinates sources that include the same statsfs_value.
> >
>
> This seems like it could have a lot of overhead if we wanted to
> periodically track the totality of subsystem stats as a form of telemetry
> gathering from userspace.  To collect telemetry for 1,000 different stats,
> do we need to issue lseek()+read() syscalls for each of them individually
> (or, worse, open()+read()+close())?
>
> Any thoughts on how that can be optimized?  A couple of ideas:
>
>  - an interface that allows gathering of all stats for a particular
>interface through a single file that would likely be encoded in binary
>and the responsibility of userspace to disseminate, or
>
>  - an interface that extends beyond this proposal and allows the reader to
>specify which stats they are interested in collecting and then the
>kernel will only provide these stats in a well formed 

Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-14 Thread Paolo Bonzini
On 14/05/20 19:35, Jonathan Adams wrote:
>> In general for statsfs we took a more explicit approach where each
>> addend in a sum is a separate stats_fs_source.  In this version of the
>> patches it's also a directory, but we'll take your feedback and add both
>> the ability to hide directories (first) and to list values (second).
>> 
>> So, in the cases of interfaces and KVM objects I would prefer to keep
>> each addend separate.
>
> This just feels like a lot of churn just to add a statistic or object;
> in your model, every time a KVM or VCPU is created, you create the N
> statistics, leading to N*M total objects.

While it's N*M files, only O(M) statsfs API calls are needed to create
them.  Whether you have O(N*M) total kmalloc-ed objects or O(M) is an
implementation detail.

Having O(N*M) API calls would be a non-started, I agree - especially
once you start thinking of more efficient publishing mechanisms that
unlike files are also O(M).

>> For CPUs that however would be pretty bad.  Many subsystems might
>> accumulate stats percpu for performance reason, which would then be
>> exposed as the sum (usually).  So yeah, native handling of percpu values
>> makes sense.  I think it should fit naturally into the same custom
>> aggregation framework as hash table keys, we'll see if there's any devil
>> in the details.
>>
>> Core kernel stats such as /proc/interrupts or /proc/stat are the
>> exception here, since individual per-CPU values can be vital for
>> debugging.  For those, creating a source per stat, possibly on-the-fly
>> at hotplug/hot-unplug time because NR_CPUS can be huge, would still be
>> my preferred way to do it.
> 
> Our metricfs has basically two modes: report all per-CPU values (for
> the IPI counts etc; you pass a callback which takes a 'int cpu'
> argument) or a callback that sums over CPUs and reports the full
> value.  It also seems hard to have any subsystem with a per-CPU stat
> having to install a hotplug callback to add/remove statistics.

Yes, this is also why I think percpu values should have some kind of
native handling.  Reporting per-CPU values individually is the exception.

> In my model, a "CPU" parameter enum which is automatically kept
> up-to-date is probably sufficient for the "report all per-CPU values".

Yes (or a separate CPU source in my model).

Paolo

> Does this make sense to you?  I realize that this is a significant
> change to the model y'all are starting with; I'm willing to do the
> work to flesh it out.


> Thanks for your time,
> - Jonathan
> 
> P.S.  Here's a summary of the types of statistics we use in metricfs
> in google, to give a little context:
> 
> - integer values (single value per stat, source also a single value);
> a couple of these are boolean values exported as '0' or '1'.
> - per-CPU integer values, reported as a  table
> - per-CPU integer values, summed and reported as an aggregate
> - single-value values, keys related to objects:
> - many per-device (disk, network, etc) integer stats
> - some per-device string data (version strings, UUIDs, and
> occasional statuses.)
> - a few histograms (usually counts by duration ranges)
> - the "function name" to count for the WARN statistic I mentioned.
> - A single statistic with two keys (for livepatch statistics; the
> value is the livepatch status as a string)
> 
> Most of the stats with keys are "complete" (every key has a value),
> but there are several examples of statistics where only some of the
> possible keys have values, or (e.g. for networking statistics) only
> the keys visible to the reading process (e.g. in its namespaces) are
> included.
> 



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-11 Thread Paolo Bonzini
Hi Jonathan, I think the remaining sticky point is this one:

On 11/05/20 19:02, Jonathan Adams wrote:
> I think I'd characterize this slightly differently; we have a set of
> statistics which are essentially "in parallel":
> 
>   - a variety of statistics, N CPUs they're available for, or
>   - a variety of statistics, N interfaces they're available for.
>   - a variety of statistics, N kvm object they're available for.
> 
> Recreating a parallel hierarchy of statistics any time we add/subtract
> a CPU or interface seems like a lot of overhead.  Perhaps a better 
> model would be some sort of "parameter enumn" (naming is hard;
> parameter set?), so when a CPU/network interface/etc is added you'd
> add its ID to the "CPUs" we know about, and at removal time you'd
> take it out; it would have an associated cbarg for the value getting
> callback.
> 
>> Yep, the above "not create a dentry" flag would handle the case where
>> you sum things up in the kernel because the more fine grained counters
>> would be overwhelming.
>
> nodnod; or the callback could handle the sum itself.

In general for statsfs we took a more explicit approach where each
addend in a sum is a separate stats_fs_source.  In this version of the
patches it's also a directory, but we'll take your feedback and add both
the ability to hide directories (first) and to list values (second).

So, in the cases of interfaces and KVM objects I would prefer to keep
each addend separate.

For CPUs that however would be pretty bad.  Many subsystems might
accumulate stats percpu for performance reason, which would then be
exposed as the sum (usually).  So yeah, native handling of percpu values
makes sense.  I think it should fit naturally into the same custom
aggregation framework as hash table keys, we'll see if there's any devil
in the details.

Core kernel stats such as /proc/interrupts or /proc/stat are the
exception here, since individual per-CPU values can be vital for
debugging.  For those, creating a source per stat, possibly on-the-fly
at hotplug/hot-unplug time because NR_CPUS can be huge, would still be
my preferred way to do it.

Thanks,

Paolo



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-11 Thread Emanuele Giuseppe Esposito


On 5/8/20 11:44 AM, Paolo Bonzini wrote:
> So in general I'd say the sources/values model holds up.  We certainly
> want to:
> 
> - switch immediately to callbacks instead of the type constants (so that
> core statsfs code only does signed/unsigned)
> 
> - add a field to distinguish cumulative and floating properties (and use
> it to determine the default file mode)
> 
> - add a new argument to statsfs_create_source and statsfs_create_values
> that makes it not create directories and files respectively
> 
> - add a new API to look for a statsfs_value recursively in all the
> subordinate sources, and pass the source/value pair to a callback
> function; and reimplement recursive aggregation and clear in terms of
> this function.

Ok I will apply this, thank you for all the suggestions. 
I will post the v3 patchset in the next few weeks. 

In the meanwhile, I wrote the documentation you asked (even though it's 
going to change in v3), you can find it here:

https://github.com/esposem/linux/commit/dfa92f270f1aed73d5f3b7f12640b2a1635c711f

Thank you,
Emanuele



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-08 Thread Paolo Bonzini
[Answering for Emanuele because he's not available until Monday]

On 07/05/20 19:45, Jonathan Adams wrote:
> This is good work.  As David Rientjes mentioned, I'm currently investigating
> a similar project, based on a google-internal debugfs-based FS we call
> "metricfs".  It's
> designed in a slightly different fashion than statsfs here is, and the
> statistics exported are
> mostly fed into our OpenTelemetry-like system.  We're motivated by
> wanting an upstreamed solution, so that we can upstream the metrics we
> create that are of general interest, and lower the overall rebasing
> burden for our tree.

Cool.  We included a public reading API exactly so that there could be
other "frontends".  I was mostly thinking of BPF as an in-tree user, but
your metricfs could definitely use the reading API.

>  - the 8/16/32/64 signed/unsigned integers seems like a wart, and the
> built-in support to grab any offset from a structure doesn't seem like
> much of an advantage. A simpler interface would be to just support an> 
> "integer" (possibly signed/unsigned) type, which is always 64-bit, and
> allow the caller to provide a function pointer to retrieve the value,
> with one or two void *s cbargs.  Then the framework could provide an
> offset-based callback (or callbacks) similar to the existing
> functionality, and a similar one for per-CPU based statistics.  A
> second "clear" callback could be optionally provided to allow for
> statistics to be cleared, as in your current proposal.

Ok, so basically splitting get_simple_value into many separate
callbacks.  The callbacks would be in a struct like

struct stats_fs_type {
uint64_t (*get)(struct stats_fs_value *, void *);
void (*clear)(struct stats_fs_value *, void *);
bool signed;
}

static uint64_t stats_fs_get_u8(struct stats_fs_value *val, void *base)
{
return *((uint8_t *)(base + (uintptr_t)val->arg);
}

static void stats_fs_clear_u8(struct stats_fs_value *val, void *base)
{
*((uint8_t *)(base + (uintptr_t)val->arg) = 0;
}

struct stats_fs_type stats_fs_type_u8 = {
stats_fs_get_u8,
stats_fs_clear_u8,
false
};

and custom types can be defined using "&(struct stats_fs_type) {...}".

>  - Beyond the statistic's type, one *very* useful piece of metadata
> for telemetry tools is knowing whether a given statistic is
> "cumulative" (an unsigned counter which is only ever increased), as
> opposed to a floating value (like "amount of memory used").

Good idea.  Also, clearing does not make sense for a floating value, so
we can use cumulative/floating to get a default for the mode: KVM
statistics for example are mostly cumulative and mode 644, except a few
that are floating and those are all mode 444.  Therefore it makes sense
to add cumulative/floating even before outputting it as metadata.

> I'm more
> concerned with getting the statistics model and capabilities right
> from the beginning, because those are harder to adjust later.

Agreed.

> 1. Each metricfs metric can have one or two string or integer "keys".
> If these exist, they expand the metric from a single value into a
> multi-dimensional table. For example, we use this to report a hash
> table we keep of functions calling "WARN()", in a 'warnings'
> statistic:
> 
> % cat .../warnings/values
> x86_pmu_stop 1
> %
>
> Indicates that the x86_pmu_stop() function has had a WARN() fire once
> since the system was booted.  If multiple functions have fired
> WARN()s, they are listed in this table with their own counts. [1]  We
> also use these to report per-CPU counters on a CPU-by-CPU basis:
> 
> % cat .../irq_x86/NMI/values
> 0 42
> 1 18
> ... one line per cpu
> % cat .../rx_bytes/values
> lo 501360681
> eth0 1457631256

These seem like two different things.

The percpu and per-interface values are best represented as subordinate
sources, one per CPU and one per interface.  For interfaces I would just
use a separate directory, but it doesn't really make sense for CPUs.  So
if we can cater for it in the model, it's better.  For example:

- add a new argument to statsfs_create_source and statsfs_create_values
that makes it not create directories and files respectively.

- add a new "aggregate function" STATS_FS_LIST that directs the parent
to build a table of all the simple values below it

We can also add a helper statsfs_add_values_percpu that creates a new
source for each CPU, I think.

The warnings one instead is a real hash table.  It should be possible to
implement it as some kind of customized aggregation, that is implemented
in the client instead of coming from subordinate sources.  The
presentation can then just use STATS_FS_LIST.  I don't see anything in
the design that is a blocker.

> 2.  We also export some metadata about each statistic.  For example,
> the metadata for the NMI counter above looks like:
> 
> % cat .../NMI/annotations
> DESCRIPTION Non-maskable\ interrupts
> CUMULATIVE
> % cat .../NMI/fields
> cpu value
> int int
> %

Good 

Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-05 Thread Christian Borntraeger
Adding Stefan Raspl, who has done a lot of kvm_stat work in the past.

On 05.05.20 19:21, Paolo Bonzini wrote:
> On 05/05/20 19:07, David Rientjes wrote:
>>> I am totally in favor of having a binary format, but it should be
>>> introduced as a separate series on top of this one---and preferably by
>>> someone who has already put some thought into the problem (which
>>> Emanuele and I have not, beyond ensuring that the statsfs concept and
>>> API is flexible enough).
>>>
>> The concern is that once this series is merged then /sys/kernel/stats 
>> could be considered an ABI and there would be a reasonable expectation 
>> that it will remain stable, in so far as the stats that userspace is 
>> interested in are stable and not obsoleted.
>>
>> So is this a suggestion that the binary format becomes complementary to 
>> statsfs and provide a means for getting all stats from a single subsystem, 
>> or that this series gets converted to such a format before it is merged?
> 
> The binary format should be complementary.  The ASCII format should
> indeed be considered stable even though individual statistics would come
> and go.  It may make sense to allow disabling ASCII files via mount
> and/or Kconfig options; but either way, the binary format can and should
> be added on top.
> 
> I have not put any thought into what the binary format would look like
> and what its features would be.  For example these are but the first
> questions that come to mind:
> 
> * would it be possible to read/clear an arbitrary statistic with
> pread/pwrite, or do you have to read all of them?
> 
> * if userspace wants to read the schema just once and then read the
> statistics many times, how is it informed of schema changes?
> 
> * and of course the details of how the schema (names of stat and
> subsources) is encoded and what details it should include about the
> values (e.g. type or just signedness).
> 
> Another possibility is to query stats via BPF.  This could be a third
> way to access the stats, or it could be alternative to a binary format.
> 
> Paolo
> 


Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-05 Thread Paolo Bonzini
On 05/05/20 19:07, David Rientjes wrote:
>> I am totally in favor of having a binary format, but it should be
>> introduced as a separate series on top of this one---and preferably by
>> someone who has already put some thought into the problem (which
>> Emanuele and I have not, beyond ensuring that the statsfs concept and
>> API is flexible enough).
>>
> The concern is that once this series is merged then /sys/kernel/stats 
> could be considered an ABI and there would be a reasonable expectation 
> that it will remain stable, in so far as the stats that userspace is 
> interested in are stable and not obsoleted.
> 
> So is this a suggestion that the binary format becomes complementary to 
> statsfs and provide a means for getting all stats from a single subsystem, 
> or that this series gets converted to such a format before it is merged?

The binary format should be complementary.  The ASCII format should
indeed be considered stable even though individual statistics would come
and go.  It may make sense to allow disabling ASCII files via mount
and/or Kconfig options; but either way, the binary format can and should
be added on top.

I have not put any thought into what the binary format would look like
and what its features would be.  For example these are but the first
questions that come to mind:

* would it be possible to read/clear an arbitrary statistic with
pread/pwrite, or do you have to read all of them?

* if userspace wants to read the schema just once and then read the
statistics many times, how is it informed of schema changes?

* and of course the details of how the schema (names of stat and
subsources) is encoded and what details it should include about the
values (e.g. type or just signedness).

Another possibility is to query stats via BPF.  This could be a third
way to access the stats, or it could be alternative to a binary format.

Paolo



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-05 Thread Paolo Bonzini
On 05/05/20 18:53, Jim Mattson wrote:
>>> Since this is becoming a generic API (good!!), maybe we can discuss
>>> possible ways to optimize gathering of stats in mass?
>> Sure, the idea of a binary format was considered from the beginning in
>> [1], and it can be done either together with the current filesystem, or
>> as a replacement via different mount options.
> 
> ASCII stats are not scalable. A binary format is definitely the way to go.

I am totally in favor of having a binary format, but it should be
introduced as a separate series on top of this one---and preferably by
someone who has already put some thought into the problem (which
Emanuele and I have not, beyond ensuring that the statsfs concept and
API is flexible enough).

ASCII stats are necessary for quick userspace consumption and for
backwards compatibility with KVM debugfs (which is not an ABI, but it's
damn useful and should not be dropped without providing something as
handy), so this is what this series starts from.

Paolo



Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-05 Thread Emanuele Giuseppe Esposito




On 5/4/20 11:37 PM, David Rientjes wrote:

On Mon, 4 May 2020, Emanuele Giuseppe Esposito wrote:



In this patch series I introduce statsfs, a synthetic ram-based virtual
filesystem that takes care of gathering and displaying statistics for the
Linux kernel subsystems.



This is exciting, we have been looking in the same area recently.  Adding
Jonathan Adams .

In your diffstat, one thing I notice that is omitted: an update to
Documentation/* :)  Any chance of getting some proposed Documentation/
updates with structure of the fs, the per subsystem breakdown, and best
practices for managing the stats from the kernel level?


Yes, I will write some documentation. Thank you for the suggestion.



Values represent quantites that are gathered by the statsfs user. Examples
of values include the number of vm exits of a given kind, the amount of
memory used by some data structure, the length of the longest hash table
chain, or anything like that. Values are defined with the
statsfs_source_add_values function. Each value is defined by a struct
statsfs_value; the same statsfs_value can be added to many different
sources. A value can be considered "simple" if it fetches data from a
user-provided location, or "aggregate" if it groups all values in the
subordinates sources that include the same statsfs_value.



This seems like it could have a lot of overhead if we wanted to
periodically track the totality of subsystem stats as a form of telemetry
gathering from userspace.  To collect telemetry for 1,000 different stats,
do we need to issue lseek()+read() syscalls for each of them individually
(or, worse, open()+read()+close())?

Any thoughts on how that can be optimized?  A couple of ideas:

  - an interface that allows gathering of all stats for a particular
interface through a single file that would likely be encoded in binary
and the responsibility of userspace to disseminate, or

  - an interface that extends beyond this proposal and allows the reader to
specify which stats they are interested in collecting and then the
kernel will only provide these stats in a well formed structure and
also be binary encoded.


Are you thinking of another file, containing all the stats for the 
directory in binary format?



We've found that the one-file-per-stat method is pretty much a show
stopper from the performance view and we always must execute at least two
syscalls to obtain a single stat.

Since this is becoming a generic API (good!!), maybe we can discuss
possible ways to optimize gathering of stats in mass?


Sure, the idea of a binary format was considered from the beginning in 
[1], and it can be done either together with the current filesystem, or 
as a replacement via different mount options.


Thank you,
Emanuele


[1] 
https://lore.kernel.org/kvm/5d6cdcb1-d8ad-7ae6-7351-3544e2fa3...@redhat.com/?fbclid=IwAR18LHJ0PBcXcDaLzILFhHsl3qpT3z2vlG60RnqgbpGYhDv7L43n0ZXJY8M





Signed-off-by: Emanuele Giuseppe Esposito 

v1->v2 remove unnecessary list_foreach_safe loops, fix wrong indentation,
change statsfs in stats_fs

Emanuele Giuseppe Esposito (5):
   refcount, kref: add dec-and-test wrappers for rw_semaphores
   stats_fs API: create, add and remove stats_fs sources and values
   kunit: tests for stats_fs API
   stats_fs fs: virtual fs to show stats to the end-user
   kvm_main: replace debugfs with stats_fs

  MAINTAINERS |7 +
  arch/arm64/kvm/Kconfig  |1 +
  arch/arm64/kvm/guest.c  |2 +-
  arch/mips/kvm/Kconfig   |1 +
  arch/mips/kvm/mips.c|2 +-
  arch/powerpc/kvm/Kconfig|1 +
  arch/powerpc/kvm/book3s.c   |6 +-
  arch/powerpc/kvm/booke.c|8 +-
  arch/s390/kvm/Kconfig   |1 +
  arch/s390/kvm/kvm-s390.c|   16 +-
  arch/x86/include/asm/kvm_host.h |2 +-
  arch/x86/kvm/Kconfig|1 +
  arch/x86/kvm/Makefile   |2 +-
  arch/x86/kvm/debugfs.c  |   64 --
  arch/x86/kvm/stats_fs.c |   56 ++
  arch/x86/kvm/x86.c  |6 +-
  fs/Kconfig  |   12 +
  fs/Makefile |1 +
  fs/stats_fs/Makefile|6 +
  fs/stats_fs/inode.c |  337 ++
  fs/stats_fs/internal.h  |   35 +
  fs/stats_fs/stats_fs-tests.c| 1088 +++
  fs/stats_fs/stats_fs.c  |  773 ++
  include/linux/kref.h|   11 +
  include/linux/kvm_host.h|   39 +-
  include/linux/refcount.h|2 +
  include/linux/stats_fs.h|  304 +
  include/uapi/linux/magic.h  |1 +
  lib/refcount.c  |   32 +
  tools/lib/api/fs/fs.c   |   21 +
  virt/kvm/arm/arm.c  |2 +-
  virt/kvm/kvm_main.c |  314 ++---
  32 files changed, 2772 insertions(+), 382 deletions(-)
  delete mode 100644 arch/x86/kvm/debugfs.c
  create mode 100644 arch/x86/kvm/stats_fs.c
  

[PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

2020-05-04 Thread Emanuele Giuseppe Esposito
There is currently no common way for Linux kernel subsystems to expose
statistics to userspace shared throughout the Linux kernel; subsystems
have to take care of gathering and displaying statistics by themselves,
for example in the form of files in debugfs. For example KVM has its own
code section that takes care of this in virt/kvm/kvm_main.c, where it sets
up debugfs handlers for displaying values and aggregating them from
various subfolders to obtain information about the system state (i.e.
displaying the total number of exits, calculated by summing all exits of
all cpus of all running virtual machines).

Allowing each section of the kernel to do so has two disadvantages. First,
it will introduce redundant code. Second, debugfs is anyway not the right
place for statistics (for example it is affected by lockdown)

In this patch series I introduce statsfs, a synthetic ram-based virtual
filesystem that takes care of gathering and displaying statistics for the
Linux kernel subsystems.

The file system is mounted on /sys/kernel/stats and would be already used
by kvm. Statsfs was initially introduced by Paolo Bonzini [1].

Statsfs offers a generic and stable API, allowing any kind of
directory/file organization and supporting multiple kind of aggregations
(not only sum, but also average, max, min and count_zero) and data types
(all unsigned and signed types plus boolean). The implementation, which is
a generalization of KVM’s debugfs statistics code, takes care of gathering
and displaying information at run time; users only need to specify the
values to be included in each source.

Statsfs would also be a different mountpoint from debugfs, and would not
suffer from limited access due to the security lock down patches. Its main
function is to display each statistics as a file in the desired folder
hierarchy defined through the API. Statsfs files can be read, and possibly
cleared if their file mode allows it.

Statsfs has two main components: the public API defined by
include/linux/statsfs.h, and the virtual file system which should end up
in /sys/kernel/stats.

The API has two main elements, values and sources. Kernel subsystems like
KVM can use the API to create a source, add child
sources/values/aggregates and register it to the root source (that on the
virtual fs would be /sys/kernel/statsfs).

Sources are created via statsfs_source_create(), and each source becomes a
directory in the file system. Sources form a parent-child relationship;
root sources are added to the file system via statsfs_source_register().
Every other source is added to or removed from a parent through the
statsfs_source_add_subordinate and statsfs_source_remote_subordinate APIs.
Once a source is created and added to the tree (via add_subordinate), it
will be used to compute aggregate values in the parent source.

Values represent quantites that are gathered by the statsfs user. Examples
of values include the number of vm exits of a given kind, the amount of
memory used by some data structure, the length of the longest hash table
chain, or anything like that. Values are defined with the
statsfs_source_add_values function. Each value is defined by a struct
statsfs_value; the same statsfs_value can be added to many different
sources. A value can be considered "simple" if it fetches data from a
user-provided location, or "aggregate" if it groups all values in the
subordinates sources that include the same statsfs_value.

For more information, please consult the kerneldoc documentation in patch
2 and the sample uses in the kunit tests and in KVM.

This series of patches is based on my previous series "libfs: group and
simplify linux fs code" and the single patch sent to kvm "kvm_host: unify
VM_STAT and VCPU_STAT definitions in a single place". The former
simplifies code duplicated in debugfs and tracefs (from which statsfs is
based on), the latter groups all macros definition for statistics in kvm
in a single common file shared by all architectures.

Patch 1 adds a new refcount and kref destructor wrappers that take a
semaphore, as those are used later by statsfs. Patch 2 introduces the
statsfs API, patch 3 provides extensive tests that can also be used as
example on how to use the API and patch 4 adds the file system support.
Finally, patch 5 provides a real-life example of statsfs usage in KVM.

[1] 
https://lore.kernel.org/kvm/5d6cdcb1-d8ad-7ae6-7351-3544e2fa3...@redhat.com/?fbclid=IwAR18LHJ0PBcXcDaLzILFhHsl3qpT3z2vlG60RnqgbpGYhDv7L43n0ZXJY8M

Signed-off-by: Emanuele Giuseppe Esposito 

v1->v2 remove unnecessary list_foreach_safe loops, fix wrong indentation,
change statsfs in stats_fs

Emanuele Giuseppe Esposito (5):
  refcount, kref: add dec-and-test wrappers for rw_semaphores
  stats_fs API: create, add and remove stats_fs sources and values
  kunit: tests for stats_fs API
  stats_fs fs: virtual fs to show stats to the end-user
  kvm_main: replace debugfs with stats_fs

 MAINTAINERS |7 +