Re: [Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-09 Thread Tvrtko Ursulin



On 09/09/16 14:45, Chris Wilson wrote:

On Fri, Sep 09, 2016 at 09:32:50AM +0100, Tvrtko Ursulin wrote:


On 08/09/16 17:40, Chris Wilson wrote:

On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

This removes the usage of intel_ring_emit in favour of
directly writing to the ring buffer.


I have the same patch! But I called it out, for historical reasons.


Yes I know we talked about it in the past but I did not think you
will find time to actually write it amongst all the other things.


Oh, except mine uses out[0]...out[N] because gcc prefers that over
*out++ = ...


It copes just fine with the latter here, for example:

*rbuf++ = cmd;
*rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
*rbuf++ = 0; /* upper addr */
*rbuf++ = 0; /* value */

Is:

  3e9:   89 10   mov%edx,(%rax)
  3eb:   c7 40 04 04 01 00 00movl   $0x104,0x4(%rax)
  3f2:   c7 40 08 00 00 00 00movl   $0x0,0x8(%rax)
  3f9:   c7 40 0c 00 00 00 00movl   $0x0,0xc(%rax)


Last time Dave suggested using something like

i915_gem_request_emit(req, (struct cmd_packet){ dw0, dw1, dw2 });

I tried mocking something up, but just found gcc was constructing the
struct on the stack and then copying across, and generating far more
code than the sequence above. Worth seeing if that is better (or if my
mockup was just bad).


Not sure that I like that. It would be a bit ugly in cases where batches 
are built dynamically, no? Perhaps I am misunderstanding the idea?


Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-09 Thread Tvrtko Ursulin


On 09/09/16 14:20, Dave Gordon wrote:

On 09/09/16 09:32, Tvrtko Ursulin wrote:


On 08/09/16 17:40, Chris Wilson wrote:

On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

This removes the usage of intel_ring_emit in favour of
directly writing to the ring buffer.


I have the same patch! But I called it out, for historical reasons.


Yes I know we talked about it in the past but I did not think you will
find time to actually write it amongst all the other things.


Oh, except mine uses out[0]...out[N] because gcc prefers that over
*out++ = ...


It copes just fine with the latter here, for example:

*rbuf++ = cmd;
*rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
*rbuf++ = 0; /* upper addr */
*rbuf++ = 0; /* value */

Is:

 3e9:   89 10   mov%edx,(%rax)
 3eb:   c7 40 04 04 01 00 00movl   $0x104,0x4(%rax)
 3f2:   c7 40 08 00 00 00 00movl   $0x0,0x8(%rax)
 3f9:   c7 40 0c 00 00 00 00movl   $0x0,0xc(%rax)

And for the record, before this patch, with intel_ring_emit:

 53a:   8b 53 3cmov0x3c(%rbx),%edx
 53d:   48 8b 4b 08 mov0x8(%rbx),%rcx
 541:   89 04 11mov%eax,(%rcx,%rdx,1)



 544:   8b 43 3cmov0x3c(%rbx),%eax
 547:   48 8b 53 08 mov0x8(%rbx),%rdx
 54b:   83 c0 04add$0x4,%eax
 54e:   89 43 3cmov%eax,0x3c(%rbx)
 551:   c7 04 02 04 01 00 00movl   $0x104,(%rdx,%rax,1)



 558:   8b 43 3cmov0x3c(%rbx),%eax
 55b:   48 8b 53 08 mov0x8(%rbx),%rdx
 55f:   83 c0 04add$0x4,%eax
 562:   89 43 3cmov%eax,0x3c(%rbx)
 565:   c7 04 02 00 00 00 00movl   $0x0,(%rdx,%rax,1)



 56c:   8b 43 3cmov0x3c(%rbx),%eax
 56f:   48 8b 53 08 mov0x8(%rbx),%rdx
 573:   83 c0 04add$0x4,%eax
 576:   89 43 3cmov%eax,0x3c(%rbx)
 579:   c7 04 02 00 00 00 00movl   $0x0,(%rdx,%rax,1)

Yuck :) At least they are not function calls to iowrite any more. :)


Curious that the inlining wasn't doing a better job, though. For
example, it's not preserving %eax as a local cache of 0x3c(%rbx).


Yeah I don't know. Even by employing the restrict keyword in various 
ways I couldn't make it do a better job.



intel_ring_emit was preventing the compiler for optimising
fetch and increment of the current ring buffer pointer and
therefore generating very verbose code for every write.

It had no useful purpose since all ringbuffer operations
are started and ended with intel_ring_begin and
intel_ring_advance respectively, with no bail out in the
middle possible, so it is fine to increment the tail in
intel_ring_begin and let the code manage the pointer
itself.


Or you could have intel_ring_advance() take the updated local and use
that to update the ring->tail. It could then check that you hadn't
exceeded your allocation, OR that you had used exactly as much as you'd
allocated. I'm sure I had a version that did that, long ago.


Sounds good to me.


Useless instruction removal amounts to approximately
2384 bytes of saved text on my build.

Not sure if this has any measurable performance
implications but executing a ton of useless instructions
on fast paths cannot be good.


It does show up in perf.


Cool.


Patch is not fully polished, but it compiles and runs
on Gen9 at least.

Signed-off-by: Tvrtko Ursulin 
---
  drivers/gpu/drm/i915/i915_gem_context.c|  62 ++--
  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  27 +-
  drivers/gpu/drm/i915/i915_gem_gtt.c|  57 ++--
  drivers/gpu/drm/i915/intel_display.c   | 113 ---
  drivers/gpu/drm/i915/intel_lrc.c   | 223 +++---
  drivers/gpu/drm/i915/intel_mocs.c  |  43 +--
  drivers/gpu/drm/i915/intel_overlay.c   |  69 ++---
  drivers/gpu/drm/i915/intel_ringbuffer.c| 480
+++--
  drivers/gpu/drm/i915/intel_ringbuffer.h|  19 +-
  9 files changed, 555 insertions(+), 538 deletions(-)


Hmm, mine is bigger.

  drivers/gpu/drm/i915/i915_gem_context.c|  85 ++--
  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  37 +-
  drivers/gpu/drm/i915/i915_gem_gtt.c|  62 +--
  drivers/gpu/drm/i915/i915_gem_request.c| 135 -
  drivers/gpu/drm/i915/i915_gem_request.h|   2 +
  drivers/gpu/drm/i915/intel_display.c   | 133 +++--
  drivers/gpu/drm/i915/intel_lrc.c   | 188 ---
  drivers/gpu/drm/i915/intel_lrc.h   |   2 -
  drivers/gpu/drm/i915/intel_mocs.c  |  50 +-
  drivers/gpu/drm/i915/intel_overlay.c   |  77 ++-
  drivers/gpu/drm/i915/intel_ringbuffer.c| 762
-
  

Re: [Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-09 Thread Chris Wilson
On Fri, Sep 09, 2016 at 09:32:50AM +0100, Tvrtko Ursulin wrote:
> 
> On 08/09/16 17:40, Chris Wilson wrote:
> >On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin 
> >>
> >>This removes the usage of intel_ring_emit in favour of
> >>directly writing to the ring buffer.
> >
> >I have the same patch! But I called it out, for historical reasons.
> 
> Yes I know we talked about it in the past but I did not think you
> will find time to actually write it amongst all the other things.
> 
> >Oh, except mine uses out[0]...out[N] because gcc prefers that over
> >*out++ = ...
> 
> It copes just fine with the latter here, for example:
> 
>   *rbuf++ = cmd;
>   *rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
>   *rbuf++ = 0; /* upper addr */
>   *rbuf++ = 0; /* value */
> 
> Is:
> 
>  3e9:   89 10   mov%edx,(%rax)
>  3eb:   c7 40 04 04 01 00 00movl   $0x104,0x4(%rax)
>  3f2:   c7 40 08 00 00 00 00movl   $0x0,0x8(%rax)
>  3f9:   c7 40 0c 00 00 00 00movl   $0x0,0xc(%rax)

Last time Dave suggested using something like

i915_gem_request_emit(req, (struct cmd_packet){ dw0, dw1, dw2 });

I tried mocking something up, but just found gcc was constructing the
struct on the stack and then copying across, and generating far more
code than the sequence above. Worth seeing if that is better (or if my
mockup was just bad).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-09 Thread Chris Wilson
On Fri, Sep 09, 2016 at 09:32:50AM +0100, Tvrtko Ursulin wrote:
> 
> On 08/09/16 17:40, Chris Wilson wrote:
> >On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin 
> >>
> >>This removes the usage of intel_ring_emit in favour of
> >>directly writing to the ring buffer.
> >
> >I have the same patch! But I called it out, for historical reasons.
> 
> Yes I know we talked about it in the past but I did not think you
> will find time to actually write it amongst all the other things.
> 
> >Oh, except mine uses out[0]...out[N] because gcc prefers that over
> >*out++ = ...
> 
> It copes just fine with the latter here, for example:
> 
>   *rbuf++ = cmd;
>   *rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
>   *rbuf++ = 0; /* upper addr */
>   *rbuf++ = 0; /* value */
> 
> Is:
> 
>  3e9:   89 10   mov%edx,(%rax)
>  3eb:   c7 40 04 04 01 00 00movl   $0x104,0x4(%rax)
>  3f2:   c7 40 08 00 00 00 00movl   $0x0,0x8(%rax)
>  3f9:   c7 40 0c 00 00 00 00movl   $0x0,0xc(%rax)

Great. Last time we had a conversation about this, and when we looked at
constructing batchbuffers in userpspace, gcc was still generating two
instuctions (*ptr followed by ptr++) rather than emitting the mov to a
fixed offset for that sequence.

> >plus an ealier
> >
> >  drivers/gpu/drm/i915/i915_gem_request.c |  26 ++---
> >  drivers/gpu/drm/i915/intel_lrc.c| 121 ---
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 168 
> > +++-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  10 +-
> >  4 files changed, 112 insertions(+), 213 deletions(-)
> >
> >since I wanted parts of it for emitting timelines.
> 
> Ok what do you want to do?

I have plans to use that particular patch soon, but updating
intel_ring_begin() itself is a long way down my list. Given that you have
a patch ready, let's keep going. I'm just curious as to what I did
differently to trim off the extra lines (probably intel_ring_advance()). 
The other thing I did was to relax the restriction to only emit in qword 
aligned packets (by fixing up the tail for qword alignment on sealing the
request). Also, I would rather the function be expressed as operating on
the request, i915_gem_request_emit() was my choice.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-09 Thread Dave Gordon

On 09/09/16 09:32, Tvrtko Ursulin wrote:


On 08/09/16 17:40, Chris Wilson wrote:

On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

This removes the usage of intel_ring_emit in favour of
directly writing to the ring buffer.


I have the same patch! But I called it out, for historical reasons.


Yes I know we talked about it in the past but I did not think you will
find time to actually write it amongst all the other things.


Oh, except mine uses out[0]...out[N] because gcc prefers that over
*out++ = ...


It copes just fine with the latter here, for example:

*rbuf++ = cmd;
*rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
*rbuf++ = 0; /* upper addr */
*rbuf++ = 0; /* value */

Is:

 3e9:   89 10   mov%edx,(%rax)
 3eb:   c7 40 04 04 01 00 00movl   $0x104,0x4(%rax)
 3f2:   c7 40 08 00 00 00 00movl   $0x0,0x8(%rax)
 3f9:   c7 40 0c 00 00 00 00movl   $0x0,0xc(%rax)

And for the record, before this patch, with intel_ring_emit:

 53a:   8b 53 3cmov0x3c(%rbx),%edx
 53d:   48 8b 4b 08 mov0x8(%rbx),%rcx
 541:   89 04 11mov%eax,(%rcx,%rdx,1)



 544:   8b 43 3cmov0x3c(%rbx),%eax
 547:   48 8b 53 08 mov0x8(%rbx),%rdx
 54b:   83 c0 04add$0x4,%eax
 54e:   89 43 3cmov%eax,0x3c(%rbx)
 551:   c7 04 02 04 01 00 00movl   $0x104,(%rdx,%rax,1)



 558:   8b 43 3cmov0x3c(%rbx),%eax
 55b:   48 8b 53 08 mov0x8(%rbx),%rdx
 55f:   83 c0 04add$0x4,%eax
 562:   89 43 3cmov%eax,0x3c(%rbx)
 565:   c7 04 02 00 00 00 00movl   $0x0,(%rdx,%rax,1)



 56c:   8b 43 3cmov0x3c(%rbx),%eax
 56f:   48 8b 53 08 mov0x8(%rbx),%rdx
 573:   83 c0 04add$0x4,%eax
 576:   89 43 3cmov%eax,0x3c(%rbx)
 579:   c7 04 02 00 00 00 00movl   $0x0,(%rdx,%rax,1)

Yuck :) At least they are not function calls to iowrite any more. :)


Curious that the inlining wasn't doing a better job, though. For 
example, it's not preserving %eax as a local cache of 0x3c(%rbx).



intel_ring_emit was preventing the compiler for optimising
fetch and increment of the current ring buffer pointer and
therefore generating very verbose code for every write.

It had no useful purpose since all ringbuffer operations
are started and ended with intel_ring_begin and
intel_ring_advance respectively, with no bail out in the
middle possible, so it is fine to increment the tail in
intel_ring_begin and let the code manage the pointer
itself.


Or you could have intel_ring_advance() take the updated local and use 
that to update the ring->tail. It could then check that you hadn't 
exceeded your allocation, OR that you had used exactly as much as you'd 
allocated. I'm sure I had a version that did that, long ago.



Useless instruction removal amounts to approximately
2384 bytes of saved text on my build.

Not sure if this has any measurable performance
implications but executing a ton of useless instructions
on fast paths cannot be good.


It does show up in perf.


Cool.


Patch is not fully polished, but it compiles and runs
on Gen9 at least.

Signed-off-by: Tvrtko Ursulin 
---
  drivers/gpu/drm/i915/i915_gem_context.c|  62 ++--
  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  27 +-
  drivers/gpu/drm/i915/i915_gem_gtt.c|  57 ++--
  drivers/gpu/drm/i915/intel_display.c   | 113 ---
  drivers/gpu/drm/i915/intel_lrc.c   | 223 +++---
  drivers/gpu/drm/i915/intel_mocs.c  |  43 +--
  drivers/gpu/drm/i915/intel_overlay.c   |  69 ++---
  drivers/gpu/drm/i915/intel_ringbuffer.c| 480
+++--
  drivers/gpu/drm/i915/intel_ringbuffer.h|  19 +-
  9 files changed, 555 insertions(+), 538 deletions(-)


Hmm, mine is bigger.

  drivers/gpu/drm/i915/i915_gem_context.c|  85 ++--
  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  37 +-
  drivers/gpu/drm/i915/i915_gem_gtt.c|  62 +--
  drivers/gpu/drm/i915/i915_gem_request.c| 135 -
  drivers/gpu/drm/i915/i915_gem_request.h|   2 +
  drivers/gpu/drm/i915/intel_display.c   | 133 +++--
  drivers/gpu/drm/i915/intel_lrc.c   | 188 ---
  drivers/gpu/drm/i915/intel_lrc.h   |   2 -
  drivers/gpu/drm/i915/intel_mocs.c  |  50 +-
  drivers/gpu/drm/i915/intel_overlay.c   |  77 ++-
  drivers/gpu/drm/i915/intel_ringbuffer.c| 762
-
  drivers/gpu/drm/i915/intel_ringbuffer.h|  36 +-
  12 files changed, 721 insertions(+), 848 deletions(-)

(this includes moving the intel_ring_begin to i915_gem_request)

plus 

Re: [Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-09 Thread Tvrtko Ursulin


On 08/09/16 17:40, Chris Wilson wrote:

On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

This removes the usage of intel_ring_emit in favour of
directly writing to the ring buffer.


I have the same patch! But I called it out, for historical reasons.


Yes I know we talked about it in the past but I did not think you will 
find time to actually write it amongst all the other things.



Oh, except mine uses out[0]...out[N] because gcc prefers that over
*out++ = ...


It copes just fine with the latter here, for example:

*rbuf++ = cmd;
*rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
*rbuf++ = 0; /* upper addr */
*rbuf++ = 0; /* value */

Is:

 3e9:   89 10   mov%edx,(%rax)
 3eb:   c7 40 04 04 01 00 00movl   $0x104,0x4(%rax)
 3f2:   c7 40 08 00 00 00 00movl   $0x0,0x8(%rax)
 3f9:   c7 40 0c 00 00 00 00movl   $0x0,0xc(%rax)

And for the record, before this patch, with intel_ring_emit:

 53a:   8b 53 3cmov0x3c(%rbx),%edx
 53d:   48 8b 4b 08 mov0x8(%rbx),%rcx
 541:   89 04 11mov%eax,(%rcx,%rdx,1)
 544:   8b 43 3cmov0x3c(%rbx),%eax
 547:   48 8b 53 08 mov0x8(%rbx),%rdx
 54b:   83 c0 04add$0x4,%eax
 54e:   89 43 3cmov%eax,0x3c(%rbx)
 551:   c7 04 02 04 01 00 00movl   $0x104,(%rdx,%rax,1)
 558:   8b 43 3cmov0x3c(%rbx),%eax
 55b:   48 8b 53 08 mov0x8(%rbx),%rdx
 55f:   83 c0 04add$0x4,%eax
 562:   89 43 3cmov%eax,0x3c(%rbx)
 565:   c7 04 02 00 00 00 00movl   $0x0,(%rdx,%rax,1)
 56c:   8b 43 3cmov0x3c(%rbx),%eax
 56f:   48 8b 53 08 mov0x8(%rbx),%rdx
 573:   83 c0 04add$0x4,%eax
 576:   89 43 3cmov%eax,0x3c(%rbx)
 579:   c7 04 02 00 00 00 00movl   $0x0,(%rdx,%rax,1)

Yuck :) At least they are not function calls to iowrite any more. :)


intel_ring_emit was preventing the compiler for optimising
fetch and increment of the current ring buffer pointer and
therefore generating very verbose code for every write.

It had no useful purpose since all ringbuffer operations
are started and ended with intel_ring_begin and
intel_ring_advance respectively, with no bail out in the
middle possible, so it is fine to increment the tail in
intel_ring_begin and let the code manage the pointer
itself.

Useless instruction removal amounts to approximately
2384 bytes of saved text on my build.

Not sure if this has any measurable performance
implications but executing a ton of useless instructions
on fast paths cannot be good.


It does show up in perf.


Cool.


Patch is not fully polished, but it compiles and runs
on Gen9 at least.

Signed-off-by: Tvrtko Ursulin 
---
  drivers/gpu/drm/i915/i915_gem_context.c|  62 ++--
  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  27 +-
  drivers/gpu/drm/i915/i915_gem_gtt.c|  57 ++--
  drivers/gpu/drm/i915/intel_display.c   | 113 ---
  drivers/gpu/drm/i915/intel_lrc.c   | 223 +++---
  drivers/gpu/drm/i915/intel_mocs.c  |  43 +--
  drivers/gpu/drm/i915/intel_overlay.c   |  69 ++---
  drivers/gpu/drm/i915/intel_ringbuffer.c| 480 +++--
  drivers/gpu/drm/i915/intel_ringbuffer.h|  19 +-
  9 files changed, 555 insertions(+), 538 deletions(-)


Hmm, mine is bigger.

  drivers/gpu/drm/i915/i915_gem_context.c|  85 ++--
  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  37 +-
  drivers/gpu/drm/i915/i915_gem_gtt.c|  62 +--
  drivers/gpu/drm/i915/i915_gem_request.c| 135 -
  drivers/gpu/drm/i915/i915_gem_request.h|   2 +
  drivers/gpu/drm/i915/intel_display.c   | 133 +++--
  drivers/gpu/drm/i915/intel_lrc.c   | 188 ---
  drivers/gpu/drm/i915/intel_lrc.h   |   2 -
  drivers/gpu/drm/i915/intel_mocs.c  |  50 +-
  drivers/gpu/drm/i915/intel_overlay.c   |  77 ++-
  drivers/gpu/drm/i915/intel_ringbuffer.c| 762 -
  drivers/gpu/drm/i915/intel_ringbuffer.h|  36 +-
  12 files changed, 721 insertions(+), 848 deletions(-)

(this includes moving the intel_ring_begin to i915_gem_request)

plus an ealier

  drivers/gpu/drm/i915/i915_gem_request.c |  26 ++---
  drivers/gpu/drm/i915/intel_lrc.c| 121 ---
  drivers/gpu/drm/i915/intel_ringbuffer.c | 168 +++-
  drivers/gpu/drm/i915/intel_ringbuffer.h |  10 +-
  4 files changed, 112 insertions(+), 213 deletions(-)

since I wanted parts of it for emitting timelines.


Ok what do you want to do?

Regards,

Tvrtko



Re: [Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-08 Thread Chris Wilson
On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> This removes the usage of intel_ring_emit in favour of
> directly writing to the ring buffer.

I have the same patch! But I called it out, for historical reasons.

Oh, except mine uses out[0]...out[N] because gcc prefers that over
*out++ = ...

> intel_ring_emit was preventing the compiler for optimising
> fetch and increment of the current ring buffer pointer and
> therefore generating very verbose code for every write.
> 
> It had no useful purpose since all ringbuffer operations
> are started and ended with intel_ring_begin and
> intel_ring_advance respectively, with no bail out in the
> middle possible, so it is fine to increment the tail in
> intel_ring_begin and let the code manage the pointer
> itself.
> 
> Useless instruction removal amounts to approximately
> 2384 bytes of saved text on my build.
> 
> Not sure if this has any measurable performance
> implications but executing a ton of useless instructions
> on fast paths cannot be good.

It does show up in perf.
 
> Patch is not fully polished, but it compiles and runs
> on Gen9 at least.
> 
> Signed-off-by: Tvrtko Ursulin 
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c|  62 ++--
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  27 +-
>  drivers/gpu/drm/i915/i915_gem_gtt.c|  57 ++--
>  drivers/gpu/drm/i915/intel_display.c   | 113 ---
>  drivers/gpu/drm/i915/intel_lrc.c   | 223 +++---
>  drivers/gpu/drm/i915/intel_mocs.c  |  43 +--
>  drivers/gpu/drm/i915/intel_overlay.c   |  69 ++---
>  drivers/gpu/drm/i915/intel_ringbuffer.c| 480 
> +++--
>  drivers/gpu/drm/i915/intel_ringbuffer.h|  19 +-
>  9 files changed, 555 insertions(+), 538 deletions(-)

Hmm, mine is bigger.

 drivers/gpu/drm/i915/i915_gem_context.c|  85 ++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  37 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c|  62 +--
 drivers/gpu/drm/i915/i915_gem_request.c| 135 -
 drivers/gpu/drm/i915/i915_gem_request.h|   2 +
 drivers/gpu/drm/i915/intel_display.c   | 133 +++--
 drivers/gpu/drm/i915/intel_lrc.c   | 188 ---
 drivers/gpu/drm/i915/intel_lrc.h   |   2 -
 drivers/gpu/drm/i915/intel_mocs.c  |  50 +-
 drivers/gpu/drm/i915/intel_overlay.c   |  77 ++-
 drivers/gpu/drm/i915/intel_ringbuffer.c| 762 -
 drivers/gpu/drm/i915/intel_ringbuffer.h|  36 +-
 12 files changed, 721 insertions(+), 848 deletions(-)

(this includes moving the intel_ring_begin to i915_gem_request)

plus an ealier

 drivers/gpu/drm/i915/i915_gem_request.c |  26 ++---
 drivers/gpu/drm/i915/intel_lrc.c| 121 ---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 168 +++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  10 +-
 4 files changed, 112 insertions(+), 213 deletions(-)

since I wanted parts of it for emitting timelines.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

2016-09-08 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

This removes the usage of intel_ring_emit in favour of
directly writing to the ring buffer.

intel_ring_emit was preventing the compiler for optimising
fetch and increment of the current ring buffer pointer and
therefore generating very verbose code for every write.

It had no useful purpose since all ringbuffer operations
are started and ended with intel_ring_begin and
intel_ring_advance respectively, with no bail out in the
middle possible, so it is fine to increment the tail in
intel_ring_begin and let the code manage the pointer
itself.

Useless instruction removal amounts to approximately
2384 bytes of saved text on my build.

Not sure if this has any measurable performance
implications but executing a ton of useless instructions
on fast paths cannot be good.

Patch is not fully polished, but it compiles and runs
on Gen9 at least.

Signed-off-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/i915_gem_context.c|  62 ++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  27 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c|  57 ++--
 drivers/gpu/drm/i915/intel_display.c   | 113 ---
 drivers/gpu/drm/i915/intel_lrc.c   | 223 +++---
 drivers/gpu/drm/i915/intel_mocs.c  |  43 +--
 drivers/gpu/drm/i915/intel_overlay.c   |  69 ++---
 drivers/gpu/drm/i915/intel_ringbuffer.c| 480 +++--
 drivers/gpu/drm/i915/intel_ringbuffer.h|  19 +-
 9 files changed, 555 insertions(+), 538 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
b/drivers/gpu/drm/i915/i915_gem_context.c
index 35950ee46a1d..c9b61953f23b 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -577,7 +577,6 @@ static inline int
 mi_set_context(struct drm_i915_gem_request *req, u32 hw_flags)
 {
struct drm_i915_private *dev_priv = req->i915;
-   struct intel_ring *ring = req->ring;
struct intel_engine_cs *engine = req->engine;
u32 flags = hw_flags | MI_MM_SPACE_GTT;
const int num_rings =
@@ -585,6 +584,7 @@ mi_set_context(struct drm_i915_gem_request *req, u32 
hw_flags)
i915.semaphores ?
INTEL_INFO(dev_priv)->num_rings - 1 :
0;
+   u32 *rbuf;
int len, ret;
 
/* w/a: If Flush TLB Invalidation Mode is enabled, driver must do a TLB
@@ -609,70 +609,61 @@ mi_set_context(struct drm_i915_gem_request *req, u32 
hw_flags)
if (INTEL_GEN(dev_priv) >= 7)
len += 2 + (num_rings ? 4*num_rings + 6 : 0);
 
-   ret = intel_ring_begin(req, len);
+   ret = intel_ring_begin(req, len, );
if (ret)
return ret;
 
/* WaProgramMiArbOnOffAroundMiSetContext:ivb,vlv,hsw,bdw,chv */
if (INTEL_GEN(dev_priv) >= 7) {
-   intel_ring_emit(ring, MI_ARB_ON_OFF | MI_ARB_DISABLE);
+   *rbuf++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
if (num_rings) {
struct intel_engine_cs *signaller;
 
-   intel_ring_emit(ring,
-   MI_LOAD_REGISTER_IMM(num_rings));
+   *rbuf++ = MI_LOAD_REGISTER_IMM(num_rings);
for_each_engine(signaller, dev_priv) {
if (signaller == engine)
continue;
 
-   intel_ring_emit_reg(ring,
-   
RING_PSMI_CTL(signaller->mmio_base));
-   intel_ring_emit(ring,
-   
_MASKED_BIT_ENABLE(GEN6_PSMI_SLEEP_MSG_DISABLE));
+   *rbuf++ = 
RING_PSMI_CTL(signaller->mmio_base).reg;
+   *rbuf++ = 
_MASKED_BIT_ENABLE(GEN6_PSMI_SLEEP_MSG_DISABLE);
}
}
}
 
-   intel_ring_emit(ring, MI_NOOP);
-   intel_ring_emit(ring, MI_SET_CONTEXT);
-   intel_ring_emit(ring,
-   i915_ggtt_offset(req->ctx->engine[RCS].state) | flags);
+   *rbuf++ = MI_NOOP;
+   *rbuf++ = MI_SET_CONTEXT;
+   *rbuf++ = i915_ggtt_offset(req->ctx->engine[RCS].state) | flags;
/*
 * w/a: MI_SET_CONTEXT must always be followed by MI_NOOP
 * WaMiSetContext_Hang:snb,ivb,vlv
 */
-   intel_ring_emit(ring, MI_NOOP);
+   *rbuf++ = MI_NOOP;
 
if (INTEL_GEN(dev_priv) >= 7) {
if (num_rings) {
struct intel_engine_cs *signaller;
i915_reg_t last_reg = {}; /* keep gcc quiet */
 
-   intel_ring_emit(ring,
-   MI_LOAD_REGISTER_IMM(num_rings));
+   *rbuf++ = MI_LOAD_REGISTER_IMM(num_rings);
for_each_engine(signaller, dev_priv) {
if