Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-24 Thread Tvrtko Ursulin



On 20/05/2021 09:35, Tvrtko Ursulin wrote:

On 19/05/2021 19:23, Daniel Vetter wrote:

On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:



On 18/05/2021 10:40, Tvrtko Ursulin wrote:


On 18/05/2021 10:16, Daniel Stone wrote:

Hi,

On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
 wrote:

I was just wondering if stat(2) and a chrdev major check would be a
solid criteria to more efficiently (compared to parsing the text
content) detect drm files while walking procfs.


Maybe I'm missing something, but is the per-PID walk actually a
measurable performance issue rather than just a bit unpleasant?


Per pid and per each open fd.

As said in the other thread what bothers me a bit in this scheme is 
that

the cost of obtaining GPU usage scales based on non-GPU criteria.

For use case of a top-like tool which shows all processes this is a
smaller additional cost, but then for a gpu-top like tool it is 
somewhat

higher.


To further expand, not only cost would scale per pid multiplies per open
fd, but to detect which of the fds are DRM I see these three options:

1) Open and parse fdinfo.
2) Name based matching ie /dev/dri/.. something.
3) Stat the symlink target and check for DRM major.


stat with symlink following should be plenty fast.


Maybe. I don't think my point about keeping the dentry cache needlessly 
hot is getting through at all. On my lightly loaded desktop:


  $ sudo lsof | wc -l
  599551

  $ sudo lsof | grep "/dev/dri/" | wc -l
  1965

It's going to look up ~600k pointless dentries in every iteration. Just 
to find a handful of DRM ones. Hard to say if that is better or worse 
than just parsing fdinfo text for all files. Will see.


CPU usage looks passable under a production kernel (non-debug). Once a 
second refresh period, on a not really that loaded system (115 running 
processes, 3096 open file descriptors as reported by lsof, none of which 
are DRM), results in a system call heavy load:


real0m55.348s
user0m0.100s
sys 0m0.319s

Once per second loop is essentially along the lines of:

  for each pid in /proc/:
for each fd in /proc//fdinfo:
  if fstatat(fd) is drm major:
read fdinfo text in one sweep and parse it

I'll post the quick intel_gpu_top patch for reference but string parsing 
in C leaves a few things to be desired there.


Regards,

Tvrtko


Re: [Nouveau] [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-21 Thread arabek
> Well if it becomes a problem fixing the debugfs "clients" file and
> making it sysfs shouldn't be much of a problem later on.

Why not to try using something in terms of perf / opensnoop or bpf
to do the work. Should be optimal enough.

ie.
http://www.brendangregg.com/blog/2014-07-25/opensnoop-for-linux.html
https://man7.org/linux/man-pages/man2/bpf.2.html


Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-20 Thread Christian König




Am 20.05.21 um 16:11 schrieb Daniel Vetter:

On Wed, May 19, 2021 at 11:17:24PM +, Nieto, David M wrote:

[AMD Official Use Only]

Parsing over 550 processes for fdinfo is taking between 40-100ms single
threaded in a 2GHz skylake IBRS within a VM using simple string
comparisons and DIRent parsing. And that is pretty much the worst case
scenario with some more optimized implementations.

I think this is plenty ok, and if it's not you could probably make this
massively faster with io_uring for all the fs operations and whack a
parser-generator on top for real parsing speed.


Well if it becomes a problem fixing the debugfs "clients" file and 
making it sysfs shouldn't be much of a problem later on.


Christian.



So imo we shouldn't worry about algorithmic inefficiency of the fdinfo
approach at all, and focuse more on trying to reasonably (but not too
much, this is still drm render stuff after all) standardize how it works
and how we'll extend it all. I think there's tons of good suggestions in
this thread on this topic already.

/me out
-Daniel


David

From: Daniel Vetter 
Sent: Wednesday, May 19, 2021 11:23 AM
To: Tvrtko Ursulin 
Cc: Daniel Stone ; jhubb...@nvidia.com ; nouv...@lists.freedesktop.org 
; Intel Graphics Development ; Maling list - DRI 
developers ; Simon Ser ; Koenig, Christian 
; arit...@nvidia.com ; Nieto, David M 
Subject: Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:


On 18/05/2021 10:40, Tvrtko Ursulin wrote:

On 18/05/2021 10:16, Daniel Stone wrote:

Hi,

On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
 wrote:

I was just wondering if stat(2) and a chrdev major check would be a
solid criteria to more efficiently (compared to parsing the text
content) detect drm files while walking procfs.

Maybe I'm missing something, but is the per-PID walk actually a
measurable performance issue rather than just a bit unpleasant?

Per pid and per each open fd.

As said in the other thread what bothers me a bit in this scheme is that
the cost of obtaining GPU usage scales based on non-GPU criteria.

For use case of a top-like tool which shows all processes this is a
smaller additional cost, but then for a gpu-top like tool it is somewhat
higher.

To further expand, not only cost would scale per pid multiplies per open
fd, but to detect which of the fds are DRM I see these three options:

1) Open and parse fdinfo.
2) Name based matching ie /dev/dri/.. something.
3) Stat the symlink target and check for DRM major.

stat with symlink following should be plenty fast.


All sound quite sub-optimal to me.

Name based matching is probably the least evil on system resource usage
(Keeping the dentry cache too hot? Too many syscalls?), even though
fundamentally I don't it is the right approach.

What happens with dup(2) is another question.

We need benchmark numbers showing that on anything remotely realistic
it's an actual problem. Until we've demonstrated it's a real problem
we don't need to solve it.

E.g. top with any sorting enabled also parses way more than it
displays on every update. It seems to be doing Just Fine (tm).


Does anyone have any feedback on the /proc//gpu idea at all?

When we know we have a problem to solve we can take a look at solutions.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2Fdata=04%7C01%7CChristian.Koenig%40amd.com%7Ced2eccaa081d4cd336d408d91b991ee0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637571166744508313%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=ZihrnanU70nJAM6bHYCjRnURDDCIdwGI85imjGd%2FNgs%3Dreserved=0




Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-20 Thread Daniel Vetter
On Wed, May 19, 2021 at 11:17:24PM +, Nieto, David M wrote:
> [AMD Official Use Only]
> 
> Parsing over 550 processes for fdinfo is taking between 40-100ms single
> threaded in a 2GHz skylake IBRS within a VM using simple string
> comparisons and DIRent parsing. And that is pretty much the worst case
> scenario with some more optimized implementations.

I think this is plenty ok, and if it's not you could probably make this
massively faster with io_uring for all the fs operations and whack a
parser-generator on top for real parsing speed.

So imo we shouldn't worry about algorithmic inefficiency of the fdinfo
approach at all, and focuse more on trying to reasonably (but not too
much, this is still drm render stuff after all) standardize how it works
and how we'll extend it all. I think there's tons of good suggestions in
this thread on this topic already.

/me out
-Daniel

> 
> David
> 
> From: Daniel Vetter 
> Sent: Wednesday, May 19, 2021 11:23 AM
> To: Tvrtko Ursulin 
> Cc: Daniel Stone ; jhubb...@nvidia.com 
> ; nouv...@lists.freedesktop.org 
> ; Intel Graphics Development 
> ; Maling list - DRI developers 
> ; Simon Ser ; Koenig, 
> Christian ; arit...@nvidia.com 
> ; Nieto, David M 
> Subject: Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness
> 
> On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
>  wrote:
> >
> >
> > On 18/05/2021 10:40, Tvrtko Ursulin wrote:
> > >
> > > On 18/05/2021 10:16, Daniel Stone wrote:
> > >> Hi,
> > >>
> > >> On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
> > >>  wrote:
> > >>> I was just wondering if stat(2) and a chrdev major check would be a
> > >>> solid criteria to more efficiently (compared to parsing the text
> > >>> content) detect drm files while walking procfs.
> > >>
> > >> Maybe I'm missing something, but is the per-PID walk actually a
> > >> measurable performance issue rather than just a bit unpleasant?
> > >
> > > Per pid and per each open fd.
> > >
> > > As said in the other thread what bothers me a bit in this scheme is that
> > > the cost of obtaining GPU usage scales based on non-GPU criteria.
> > >
> > > For use case of a top-like tool which shows all processes this is a
> > > smaller additional cost, but then for a gpu-top like tool it is somewhat
> > > higher.
> >
> > To further expand, not only cost would scale per pid multiplies per open
> > fd, but to detect which of the fds are DRM I see these three options:
> >
> > 1) Open and parse fdinfo.
> > 2) Name based matching ie /dev/dri/.. something.
> > 3) Stat the symlink target and check for DRM major.
> 
> stat with symlink following should be plenty fast.
> 
> > All sound quite sub-optimal to me.
> >
> > Name based matching is probably the least evil on system resource usage
> > (Keeping the dentry cache too hot? Too many syscalls?), even though
> > fundamentally I don't it is the right approach.
> >
> > What happens with dup(2) is another question.
> 
> We need benchmark numbers showing that on anything remotely realistic
> it's an actual problem. Until we've demonstrated it's a real problem
> we don't need to solve it.
> 
> E.g. top with any sorting enabled also parses way more than it
> displays on every update. It seems to be doing Just Fine (tm).
> 
> > Does anyone have any feedback on the /proc//gpu idea at all?
> 
> When we know we have a problem to solve we can take a look at solutions.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2Fdata=04%7C01%7CDavid.Nieto%40amd.com%7Cf6aea97532cf41f916de08d91af32cc1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570453997158377%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=4CFrY9qWbJREcIcSzeO9KIn2P%2Fw6k%2BYdNlh6rdS%2BEh4%3Dreserved=0

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-20 Thread Tvrtko Ursulin



On 19/05/2021 19:23, Daniel Vetter wrote:

On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:



On 18/05/2021 10:40, Tvrtko Ursulin wrote:


On 18/05/2021 10:16, Daniel Stone wrote:

Hi,

On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
 wrote:

I was just wondering if stat(2) and a chrdev major check would be a
solid criteria to more efficiently (compared to parsing the text
content) detect drm files while walking procfs.


Maybe I'm missing something, but is the per-PID walk actually a
measurable performance issue rather than just a bit unpleasant?


Per pid and per each open fd.

As said in the other thread what bothers me a bit in this scheme is that
the cost of obtaining GPU usage scales based on non-GPU criteria.

For use case of a top-like tool which shows all processes this is a
smaller additional cost, but then for a gpu-top like tool it is somewhat
higher.


To further expand, not only cost would scale per pid multiplies per open
fd, but to detect which of the fds are DRM I see these three options:

1) Open and parse fdinfo.
2) Name based matching ie /dev/dri/.. something.
3) Stat the symlink target and check for DRM major.


stat with symlink following should be plenty fast.


Maybe. I don't think my point about keeping the dentry cache needlessly 
hot is getting through at all. On my lightly loaded desktop:


 $ sudo lsof | wc -l
 599551

 $ sudo lsof | grep "/dev/dri/" | wc -l
 1965

It's going to look up ~600k pointless dentries in every iteration. Just 
to find a handful of DRM ones. Hard to say if that is better or worse 
than just parsing fdinfo text for all files. Will see.



All sound quite sub-optimal to me.

Name based matching is probably the least evil on system resource usage
(Keeping the dentry cache too hot? Too many syscalls?), even though
fundamentally I don't it is the right approach.

What happens with dup(2) is another question.


We need benchmark numbers showing that on anything remotely realistic
it's an actual problem. Until we've demonstrated it's a real problem
we don't need to solve it.


Point about dup(2) is whether it is possible to distinguish the 
duplicated fds in fdinfo. If a DRM client dupes, and we found two 
fdinfos each saying client is using 20% GPU, we don't want to add it up 
to 40%.



E.g. top with any sorting enabled also parses way more than it
displays on every update. It seems to be doing Just Fine (tm).


Ha, perceptions differ. I see it using 4-5% while building the kernel on 
a Xeon server which I find quite a lot. :)



Does anyone have any feedback on the /proc//gpu idea at all?


When we know we have a problem to solve we can take a look at solutions.


Yes I don't think the problem would be to add a better solution later, 
so happy to try the fdinfo first. I am simply pointing out a fundamental 
design inefficiency. Even if machines are getting faster and faster I 
don't think that should be an excuse to waste more and more under the 
hood, when a more efficient solution can be designed from the start.


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-19 Thread Nieto, David M
[AMD Official Use Only]

Parsing over 550 processes for fdinfo is taking between 40-100ms single 
threaded in a 2GHz skylake IBRS within a VM using simple string comparisons and 
DIRent parsing. And that is pretty much the worst case scenario with some more 
optimized implementations.

David

From: Daniel Vetter 
Sent: Wednesday, May 19, 2021 11:23 AM
To: Tvrtko Ursulin 
Cc: Daniel Stone ; jhubb...@nvidia.com 
; nouv...@lists.freedesktop.org 
; Intel Graphics Development 
; Maling list - DRI developers 
; Simon Ser ; Koenig, 
Christian ; arit...@nvidia.com ; 
Nieto, David M 
Subject: Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:
>
>
> On 18/05/2021 10:40, Tvrtko Ursulin wrote:
> >
> > On 18/05/2021 10:16, Daniel Stone wrote:
> >> Hi,
> >>
> >> On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
> >>  wrote:
> >>> I was just wondering if stat(2) and a chrdev major check would be a
> >>> solid criteria to more efficiently (compared to parsing the text
> >>> content) detect drm files while walking procfs.
> >>
> >> Maybe I'm missing something, but is the per-PID walk actually a
> >> measurable performance issue rather than just a bit unpleasant?
> >
> > Per pid and per each open fd.
> >
> > As said in the other thread what bothers me a bit in this scheme is that
> > the cost of obtaining GPU usage scales based on non-GPU criteria.
> >
> > For use case of a top-like tool which shows all processes this is a
> > smaller additional cost, but then for a gpu-top like tool it is somewhat
> > higher.
>
> To further expand, not only cost would scale per pid multiplies per open
> fd, but to detect which of the fds are DRM I see these three options:
>
> 1) Open and parse fdinfo.
> 2) Name based matching ie /dev/dri/.. something.
> 3) Stat the symlink target and check for DRM major.

stat with symlink following should be plenty fast.

> All sound quite sub-optimal to me.
>
> Name based matching is probably the least evil on system resource usage
> (Keeping the dentry cache too hot? Too many syscalls?), even though
> fundamentally I don't it is the right approach.
>
> What happens with dup(2) is another question.

We need benchmark numbers showing that on anything remotely realistic
it's an actual problem. Until we've demonstrated it's a real problem
we don't need to solve it.

E.g. top with any sorting enabled also parses way more than it
displays on every update. It seems to be doing Just Fine (tm).

> Does anyone have any feedback on the /proc//gpu idea at all?

When we know we have a problem to solve we can take a look at solutions.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2Fdata=04%7C01%7CDavid.Nieto%40amd.com%7Cf6aea97532cf41f916de08d91af32cc1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570453997158377%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=4CFrY9qWbJREcIcSzeO9KIn2P%2Fw6k%2BYdNlh6rdS%2BEh4%3Dreserved=0


Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-19 Thread Daniel Vetter
On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:
>
>
> On 18/05/2021 10:40, Tvrtko Ursulin wrote:
> >
> > On 18/05/2021 10:16, Daniel Stone wrote:
> >> Hi,
> >>
> >> On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
> >>  wrote:
> >>> I was just wondering if stat(2) and a chrdev major check would be a
> >>> solid criteria to more efficiently (compared to parsing the text
> >>> content) detect drm files while walking procfs.
> >>
> >> Maybe I'm missing something, but is the per-PID walk actually a
> >> measurable performance issue rather than just a bit unpleasant?
> >
> > Per pid and per each open fd.
> >
> > As said in the other thread what bothers me a bit in this scheme is that
> > the cost of obtaining GPU usage scales based on non-GPU criteria.
> >
> > For use case of a top-like tool which shows all processes this is a
> > smaller additional cost, but then for a gpu-top like tool it is somewhat
> > higher.
>
> To further expand, not only cost would scale per pid multiplies per open
> fd, but to detect which of the fds are DRM I see these three options:
>
> 1) Open and parse fdinfo.
> 2) Name based matching ie /dev/dri/.. something.
> 3) Stat the symlink target and check for DRM major.

stat with symlink following should be plenty fast.

> All sound quite sub-optimal to me.
>
> Name based matching is probably the least evil on system resource usage
> (Keeping the dentry cache too hot? Too many syscalls?), even though
> fundamentally I don't it is the right approach.
>
> What happens with dup(2) is another question.

We need benchmark numbers showing that on anything remotely realistic
it's an actual problem. Until we've demonstrated it's a real problem
we don't need to solve it.

E.g. top with any sorting enabled also parses way more than it
displays on every update. It seems to be doing Just Fine (tm).

> Does anyone have any feedback on the /proc//gpu idea at all?

When we know we have a problem to solve we can take a look at solutions.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch