Re: [Intel-gfx] [PATCH v3 0/4] Dynamic EU configuration of Slice/Subslice/EU.

2018-12-14 Thread Navik, Ankit P
Hi Joonas, 

On Fri, Dec 14, 2018 at 3:57 PM Joonas Lahtinen 
 wrote:
> 
> Quoting Ankit Navik (2018-12-11 12:14:17)
> > drm/i915: Context aware user agnostic EU/Slice/Sub-slice control
> > within kernel
> >
> > Current GPU configuration code for i915 does not allow us to change
> > EU/Slice/Sub-slice configuration dynamically. Its done only once while
> > context is created.
> >
> > While particular graphics application is running, if we examine the
> > command requests from user space, we observe that command density is not
> consistent.
> > It means there is scope to change the graphics configuration
> > dynamically even while context is running actively. This patch series
> > proposes the solution to find the active pending load for all active
> > context at given time and based on that, dynamically perform graphics
> configuration for each context.
> >
> > We use a hr (high resolution) timer with i915 driver in kernel to get
> > a callback every few milliseconds (this timer value can be configured
> > through debugfs, default is '0' indicating timer is in disabled state
> > i.e. original system without any intervention).In the timer callback,
> > we examine pending commands for a context in the queue, essentially,
> > we intercept them before they are executed by GPU and we update context
> with required number of EUs.
> >
> > Two questions, how did we arrive at right timer value? and what's the
> > right number of EUs? For the prior one, empirical data to achieve best
> > performance in least power was considered. For the later one, we
> > roughly categorized number of EUs logically based on platform. Now we
> > compare number of pending commands with a particular threshold and
> > then set number of EUs accordingly with update context. That threshold
> > is also based on experiments & findings. If GPU is able to catch up
> > with CPU, typically there are no pending commands, the EU config would
> > remain unchanged there. In case there are more pending commands we
> > reprogram context with higher number of EUs. Please note, here we are
> changing EUs even while context is running by examining pending commands
> every 'x'
> > milliseconds.
> 
> On the overall strategy. This will be unsuitable to be merged as a debugfs
> interface. So is the idea to evolve into a sysfs interface? As this seems to 
> require
> tuning for each specific workload, I don't think that would scale too well if 
> you
> consider a desktop distro?

We started initially as debugfs interface. I have added the comment to
move the functionality into sysfs interface. Yes, I will consider the desktop
distro and share the detail results.
> 
> Also, there's the patch series to to enable/disable subslices with VME 
> hardware
> (the other dynamic slice shutdown/SSEU series) depending on the type of load
> being run. Certain workloads would hang the system if they're executed with 
> full
> subslice configuration. In that light, it would make more sense if the 
> applications
> would be the ones reporting their optimal running configuration.

I think, the series expose rpcs for gen 11 only for VME use case.
The patch I have tested on KBL (Gen 9). I will consider other gen 9 platform as 
well. 

> 
> >
> > With this solution in place, on KBL-GT3 + Android we saw following pnp
> > benefits, power numbers mentioned here are system power.
> >
> > App /KPI   | % Power |
> >| Benefit |
> >|  (mW)   |
> > -|
> > 3D Mark (Ice storm)| 2.30%   |
> > TRex On screen | 2.49%   |
> > TRex Off screen| 1.32%   |
> > ManhattanOn screen | 3.11%   |
> > Manhattan Off screen   | 0.89%   |
> > AnTuTu  6.1.4  | 3.42%   |
> > SynMark2   | 1.70%   |
> 
> Just to verify, these numbers are true while there's no negative effect on the
> benchmark scores?

Yes, There is no impact on the benchmark scores.
Thank you Joonas for your valuable feedback.

Regards, Ankit

> 
> Regards, Joonas
> 
> > Note - For KBL (GEN9) we cannot control at sub-slice level, it was
> > always  a constraint.
> > We always controlled number of EUs rather than sub-slices/slices.
> > We have also observed GPU core residencies improves by 1.03%.
> >
> > Praveen Diwakar (4):
> >   drm/i915: Get active pending request for given context
> >   drm/i915: Update render power clock state configuration for given
> > context
> >   drm/i915: set optimum eu/slice/sub-slice configuration based on load
> > type
> >   drm/i915: Predictive governor to control eu/slice/subslice
> >
> >  drivers/gpu/drm/i915/i915_debugfs.c  | 90
> +++-
> >  drivers/gpu/drm/i915/i915_drv.c  |  4 ++
> >  drivers/gpu/drm/i915/i915_drv.h  |  9 
> >  drivers/gpu/drm/i915/i915_gem_context.c  | 23 
> > drivers/gpu/drm/i915/i915_gem_context.h  | 39 ++
> >  drivers/gpu/drm/i915/i915_request.c  |  2 +
> >  

Re: [Intel-gfx] [PATCH v3 0/4] Dynamic EU configuration of Slice/Subslice/EU.

2018-12-14 Thread Joonas Lahtinen
Quoting Ankit Navik (2018-12-11 12:14:17)
> drm/i915: Context aware user agnostic EU/Slice/Sub-slice control within kernel
> 
> Current GPU configuration code for i915 does not allow us to change
> EU/Slice/Sub-slice configuration dynamically. Its done only once while context
> is created.
> 
> While particular graphics application is running, if we examine the command
> requests from user space, we observe that command density is not consistent.
> It means there is scope to change the graphics configuration dynamically even
> while context is running actively. This patch series proposes the solution to
> find the active pending load for all active context at given time and based on
> that, dynamically perform graphics configuration for each context.
> 
> We use a hr (high resolution) timer with i915 driver in kernel to get a
> callback every few milliseconds (this timer value can be configured through
> debugfs, default is '0' indicating timer is in disabled state i.e. original
> system without any intervention).In the timer callback, we examine pending
> commands for a context in the queue, essentially, we intercept them before
> they are executed by GPU and we update context with required number of EUs.
> 
> Two questions, how did we arrive at right timer value? and what's the right
> number of EUs? For the prior one, empirical data to achieve best performance
> in least power was considered. For the later one, we roughly categorized 
> number 
> of EUs logically based on platform. Now we compare number of pending commands
> with a particular threshold and then set number of EUs accordingly with update
> context. That threshold is also based on experiments & findings. If GPU is 
> able
> to catch up with CPU, typically there are no pending commands, the EU config
> would remain unchanged there. In case there are more pending commands we
> reprogram context with higher number of EUs. Please note, here we are changing
> EUs even while context is running by examining pending commands every 'x'
> milliseconds.

On the overall strategy. This will be unsuitable to be merged as a
debugfs interface. So is the idea to evolve into a sysfs interface? As
this seems to require tuning for each specific workload, I don't think
that would scale too well if you consider a desktop distro?

Also, there's the patch series to to enable/disable subslices with VME
hardware (the other dynamic slice shutdown/SSEU series) depending on the
type of load being run. Certain workloads would hang the system if
they're executed with full subslice configuration. In that light, it
would make more sense if the applications would be the ones reporting
their optimal running configuration.

> 
> With this solution in place, on KBL-GT3 + Android we saw following pnp
> benefits, power numbers mentioned here are system power.
> 
> App /KPI   | % Power |
>| Benefit |
>|  (mW)   |
> -|
> 3D Mark (Ice storm)| 2.30%   |
> TRex On screen | 2.49%   |
> TRex Off screen| 1.32%   |
> ManhattanOn screen | 3.11%   |
> Manhattan Off screen   | 0.89%   |
> AnTuTu  6.1.4  | 3.42%   |
> SynMark2   | 1.70%   |

Just to verify, these numbers are true while there's no negative effect
on the benchmark scores?

Regards, Joonas

> Note - For KBL (GEN9) we cannot control at sub-slice level, it was always  a
> constraint.
> We always controlled number of EUs rather than sub-slices/slices.
> We have also observed GPU core residencies improves by 1.03%.
> 
> Praveen Diwakar (4):
>   drm/i915: Get active pending request for given context
>   drm/i915: Update render power clock state configuration for given
> context
>   drm/i915: set optimum eu/slice/sub-slice configuration based on load
> type
>   drm/i915: Predictive governor to control eu/slice/subslice
> 
>  drivers/gpu/drm/i915/i915_debugfs.c  | 90 
> +++-
>  drivers/gpu/drm/i915/i915_drv.c  |  4 ++
>  drivers/gpu/drm/i915/i915_drv.h  |  9 
>  drivers/gpu/drm/i915/i915_gem_context.c  | 23 
>  drivers/gpu/drm/i915/i915_gem_context.h  | 39 ++
>  drivers/gpu/drm/i915/i915_request.c  |  2 +
>  drivers/gpu/drm/i915/intel_device_info.c | 47 -
>  drivers/gpu/drm/i915/intel_lrc.c | 16 +-
>  8 files changed, 226 insertions(+), 4 deletions(-)
> 
> -- 
> 2.7.4
> 
> ___
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v3 0/4] Dynamic EU configuration of Slice/Subslice/EU.

2018-12-11 Thread Navik, Ankit P
Hi Tvrtko, 

On Tue, Dec 11, 2018 at 5:18 PM Tvrtko Ursulin  
wrote:
> 
> 
> On 11/12/2018 10:14, Ankit Navik wrote:
> > drm/i915: Context aware user agnostic EU/Slice/Sub-slice control
> > within kernel
> >
> > Current GPU configuration code for i915 does not allow us to change
> > EU/Slice/Sub-slice configuration dynamically. Its done only once while
> > context is created.
> >
> > While particular graphics application is running, if we examine the
> > command requests from user space, we observe that command density is not
> consistent.
> > It means there is scope to change the graphics configuration
> > dynamically even while context is running actively. This patch series
> > proposes the solution to find the active pending load for all active
> > context at given time and based on that, dynamically perform graphics
> configuration for each context.
> >
> > We use a hr (high resolution) timer with i915 driver in kernel to get
> > a callback every few milliseconds (this timer value can be configured
> > through debugfs, default is '0' indicating timer is in disabled state
> > i.e. original system without any intervention).In the timer callback,
> > we examine pending commands for a context in the queue, essentially,
> > we intercept them before they are executed by GPU and we update context
> with required number of EUs.
> >
> > Two questions, how did we arrive at right timer value? and what's the
> > right number of EUs? For the prior one, empirical data to achieve best
> > performance in least power was considered. For the later one, we
> > roughly categorized number of EUs logically based on platform. Now we
> > compare number of pending commands with a particular threshold and
> > then set number of EUs accordingly with update context. That threshold
> > is also based on experiments & findings. If GPU is able to catch up
> > with CPU, typically there are no pending commands, the EU config would
> > remain unchanged there. In case there are more pending commands we
> > reprogram context with higher number of EUs. Please note, here we are
> changing EUs even while context is running by examining pending commands
> every 'x'
> > milliseconds.
> >
> > With this solution in place, on KBL-GT3 + Android we saw following pnp
> > benefits, power numbers mentioned here are system power.
> >
> > App /KPI   | % Power |
> > | Benefit |
> > |  (mW)   |
> > -|
> > 3D Mark (Ice storm)| 2.30%   |
> > TRex On screen | 2.49%   |
> > TRex Off screen| 1.32%   |
> > ManhattanOn screen | 3.11%   |
> > Manhattan Off screen   | 0.89%   |
> > AnTuTu  6.1.4  | 3.42%   |
> > SynMark2   | 1.70%   |
> 
> Is this the aggregated SynMark2 result, like all sub-tests averaged or 
> something?

Yes, It is averaged result covering all the test cases.
> 
> I suggest you do want to list much more detail here, all individual sub-tests,
> different platforms, etc. The change you are proposing is quite big and the
> amount of research that you must demonstrate for people to take this seriously
> has to be equally exhaustive.

I will verify and add more details covering various platform and sub-tests.

Regards, Ankit 
> 
> Regards,
> 
> Tvrtko
> 
> >
> > Note - For KBL (GEN9) we cannot control at sub-slice level, it was
> > always  a constraint.
> > We always controlled number of EUs rather than sub-slices/slices.
> > We have also observed GPU core residencies improves by 1.03%.
> >
> > Praveen Diwakar (4):
> >drm/i915: Get active pending request for given context
> >drm/i915: Update render power clock state configuration for given
> >  context
> >drm/i915: set optimum eu/slice/sub-slice configuration based on load
> >  type
> >drm/i915: Predictive governor to control eu/slice/subslice
> >
> >   drivers/gpu/drm/i915/i915_debugfs.c  | 90
> +++-
> >   drivers/gpu/drm/i915/i915_drv.c  |  4 ++
> >   drivers/gpu/drm/i915/i915_drv.h  |  9 
> >   drivers/gpu/drm/i915/i915_gem_context.c  | 23 
> >   drivers/gpu/drm/i915/i915_gem_context.h  | 39 ++
> >   drivers/gpu/drm/i915/i915_request.c  |  2 +
> >   drivers/gpu/drm/i915/intel_device_info.c | 47 -
> >   drivers/gpu/drm/i915/intel_lrc.c | 16 +-
> >   8 files changed, 226 insertions(+), 4 deletions(-)
> >
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v3 0/4] Dynamic EU configuration of Slice/Subslice/EU.

2018-12-11 Thread Tvrtko Ursulin


On 11/12/2018 10:14, Ankit Navik wrote:

drm/i915: Context aware user agnostic EU/Slice/Sub-slice control within kernel

Current GPU configuration code for i915 does not allow us to change
EU/Slice/Sub-slice configuration dynamically. Its done only once while context
is created.

While particular graphics application is running, if we examine the command
requests from user space, we observe that command density is not consistent.
It means there is scope to change the graphics configuration dynamically even
while context is running actively. This patch series proposes the solution to
find the active pending load for all active context at given time and based on
that, dynamically perform graphics configuration for each context.

We use a hr (high resolution) timer with i915 driver in kernel to get a
callback every few milliseconds (this timer value can be configured through
debugfs, default is '0' indicating timer is in disabled state i.e. original
system without any intervention).In the timer callback, we examine pending
commands for a context in the queue, essentially, we intercept them before
they are executed by GPU and we update context with required number of EUs.

Two questions, how did we arrive at right timer value? and what's the right
number of EUs? For the prior one, empirical data to achieve best performance
in least power was considered. For the later one, we roughly categorized number
of EUs logically based on platform. Now we compare number of pending commands
with a particular threshold and then set number of EUs accordingly with update
context. That threshold is also based on experiments & findings. If GPU is able
to catch up with CPU, typically there are no pending commands, the EU config
would remain unchanged there. In case there are more pending commands we
reprogram context with higher number of EUs. Please note, here we are changing
EUs even while context is running by examining pending commands every 'x'
milliseconds.

With this solution in place, on KBL-GT3 + Android we saw following pnp
benefits, power numbers mentioned here are system power.

App /KPI   | % Power |
| Benefit |
|  (mW)   |
-|
3D Mark (Ice storm)| 2.30%   |
TRex On screen | 2.49%   |
TRex Off screen| 1.32%   |
ManhattanOn screen | 3.11%   |
Manhattan Off screen   | 0.89%   |
AnTuTu  6.1.4  | 3.42%   |
SynMark2   | 1.70%   |


Is this the aggregated SynMark2 result, like all sub-tests averaged or 
something?


I suggest you do want to list much more detail here, all individual 
sub-tests, different platforms, etc. The change you are proposing is 
quite big and the amount of research that you must demonstrate for 
people to take this seriously has to be equally exhaustive.


Regards,

Tvrtko



Note - For KBL (GEN9) we cannot control at sub-slice level, it was always  a
constraint.
We always controlled number of EUs rather than sub-slices/slices.
We have also observed GPU core residencies improves by 1.03%.

Praveen Diwakar (4):
   drm/i915: Get active pending request for given context
   drm/i915: Update render power clock state configuration for given
 context
   drm/i915: set optimum eu/slice/sub-slice configuration based on load
 type
   drm/i915: Predictive governor to control eu/slice/subslice

  drivers/gpu/drm/i915/i915_debugfs.c  | 90 +++-
  drivers/gpu/drm/i915/i915_drv.c  |  4 ++
  drivers/gpu/drm/i915/i915_drv.h  |  9 
  drivers/gpu/drm/i915/i915_gem_context.c  | 23 
  drivers/gpu/drm/i915/i915_gem_context.h  | 39 ++
  drivers/gpu/drm/i915/i915_request.c  |  2 +
  drivers/gpu/drm/i915/intel_device_info.c | 47 -
  drivers/gpu/drm/i915/intel_lrc.c | 16 +-
  8 files changed, 226 insertions(+), 4 deletions(-)


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx