Re: [Mesa-dev] NLNet Funded development of a software/hardware MESA driver for the Libre GPGPU

2020-01-13 Thread Luke Kenneth Casson Leighton
On Tuesday, January 14, 2020, Jacob Lifshay 
wrote:

> On Mon, Jan 13, 2020 at 9:39 AM Jason Ekstrand 
> wrote:
> >
> > On Mon, Jan 13, 2020 at 11:27 AM Luke Kenneth Casson Leighton <
> l...@lkcl.net> wrote:
> >> jason i'd be interested to hear your thoughts on what jacob wrote, does
> it alleviate your concerns, (we're not designing hardware specifically
> around vec2/3/4, it simply has that capability).
> >
> >
> > Not at all.  If you just want a SW renderer that runs on RISC-V, feel
> free to write one.


as we know, it would be embarrassingly low performance, not commercially
viable, therefore, logically, we can rule that out as an option to pursue :)

i don't know if you're aware of Jeff Bush's work on Nyuzi? he set out to
duplicate the work of the Intel Larrabee team (a software only GPU
experiment), in an academic way (i.e publishing everything, no matter how
"bad")

Jeff sought an answer to the question as to, ahem, why the Larrabee team
were not, ahem, "permitted" to publish GPU benchmarks for their work,
despite it having high end supercomputer-grade Vector Processing capability.

i spent several months in discussion with him, i really enjoyed the
conversations.  we established that if you were to deploy a *standard*
Vector Processor General Purpose ISA and engine (Nyuzi, Cray, MMX/SSE/AVX,
RISCV RVV), with *zero* special custom hardware for 3D (so, no custom
texturisation, no custom z buffers, no special tiled memory or associated
pixel opcodes) the performance/watt that you would get would be a QUARTER
of current commercial GPUs.

in other words you need either four times the silicon (four times the power
consumption) just to be on par with current commercial GPUs, or you have to
sell (only if completely delusional) something that's 25% the performance.

therefore, we have learned from that lesson, and will not be following that
exact route either :)

 If you want to vectorize in hardware and actually get serious performance
> out of it, I highly doubt his plan will work.  That said, I wasn't planning
> to work on it so none of this is my problem so you're welcome to take or
> leave anything I say. :-)


:)


> So, since it may not have been clearly explained before, the GPU we're
> building has masked vectorization like most other GPUs, it just that
> it additionally supports the masked vectors' elements being 1 to 4
> element subvectors.


further: this is based on RVV (RISCV Vectors) which in turn is based on the
Cray Vector system.

the plan is to *begin* from this base, and, following the strategy that's
documented in Jeff Bush's 2016 paper, assess performance based on
pixels/clock and also, again, following Jeff's work, keep a Seriously Close
Eye on the power consumption.

(we've already added 128 registers, for example, because on GPU workloads,
which are heavily LD-compute-ST on discontiguous memory areas, you
absolutely cannot afford the power penalty of swapping out large numbers of
registers through the L1/L2 cache barrier)

Jeff's strategies we will use as *iterative* guides to making improvements,
just ad he did.  he actually went through seven different designs (maybe 8
if you include the ChiselGPU triangle raster engine he wrote)

If it turns out that using subvectors makes the GPU slower, we can add
> a scalarization pass before the SIMT to vector translation, converting
> everything to using more conventional operations.


yes, exactly.  and that would be one of the kinds of tasks for which the
NLNet funding is available.

so that would be one very good example of something that would be assessed
using Jeff Bush's methodology.

what's nice about this is: it's literally an opportunity for a Software
Engineer working on MESA to, instead of saying "damnit these hardware
engineers really messed up, i feel totally powerless to fix it", to say
"this isn't good enough! i need instruction X to get better performance!"
and instead of saying "sorry we taped out already, deal with it, derwood"
we go, "okay, great, give us 2 weeks and you can test out a new instruction
X. start writing code to use it!"

i know that there is someone out there who, on reading this, is going to go
"cool! and the actual hardware's libre too, and.. wait... i get money for
this???"

:)

so again, jason, i'd like to emphasise again just how grateful i am that
you raised the issue of subvectors, because now we can put it on the list
of things to watch out for and experiment with.

and, just to be clear: we've already had this iterative approach approved
by NLNet: to start from a known-good (highly suboptimal but Vulkan
Compliant) driver and to experiment with designs (hopefully not at the
microarchitectural level) and instructions (a lot) and change the ISA
(hopefully not a lot), to, over time, reach commercially acceptable
performance.

and it's entirely libre.  paid...and libre.  who knew _that_ would ever
happen in the GPU world?

l.




-- 
---
crowd-funded eco-conscious hardware: 

Re: [Mesa-dev] NLNet Funded development of a software/hardware MESA driver for the Libre GPGPU

2020-01-13 Thread Jacob Lifshay
On Mon, Jan 13, 2020 at 9:39 AM Jason Ekstrand  wrote:
>
> On Mon, Jan 13, 2020 at 11:27 AM Luke Kenneth Casson Leighton  
> wrote:
>> jason i'd be interested to hear your thoughts on what jacob wrote, does it 
>> alleviate your concerns, (we're not designing hardware specifically around 
>> vec2/3/4, it simply has that capability).
>
>
> Not at all.  If you just want a SW renderer that runs on RISC-V, feel free to 
> write one.  If you want to vectorize in hardware and actually get serious 
> performance out of it, I highly doubt his plan will work.  That said, I 
> wasn't planning to work on it so none of this is my problem so you're welcome 
> to take or leave anything I say. :-)

So, since it may not have been clearly explained before, the GPU we're
building has masked vectorization like most other GPUs, it just that
it additionally supports the masked vectors' elements being 1 to 4
element subvectors.

If it turns out that using subvectors makes the GPU slower, we can add
a scalarization pass before the SIMT to vector translation, converting
everything to using more conventional operations.

Jacob
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] NLNet Funded development of a software/hardware MESA driver for the Libre GPGPU

2020-01-13 Thread Jason Ekstrand
On Mon, Jan 13, 2020 at 11:27 AM Luke Kenneth Casson Leighton 
wrote:

>
>
> On Monday, January 13, 2020, Jacob Lifshay 
> wrote:
>
>> On Thu, Jan 9, 2020 at 3:56 AM Luke Kenneth Casson Leighton
>>  wrote:
>> >
>>
>> > jacob perhaps you could clarify, here?
>>
>> So the major issue with the approach AMDGPU took where the SIMT to
>> predicated vector translation is done by the LLVM backend is that LLVM
>> doesn't really maintain a reducible CFG, which is needed to correctly
>> vectorize the code without devolving to a switch-in-a-loop.
>>
>
Welcome to working on GPUs.  :-)


> Hopefully, that all made sense. :)
>
>
> yes :) as you're actively designing this you have a way better handle on
> it.
>
> also, therefore, to be clear, to anyone interested in receiving funding to
> do this work, you can see that there will be someone else to work with who
> knows what they're doing, technically.
>
> thank you jacob.
>
> jason i'd be interested to hear your thoughts on what jacob wrote, does it
> alleviate your concerns, (we're not designing hardware specifically around
> vec2/3/4, it simply has that capability).
>

Not at all.  If you just want a SW renderer that runs on RISC-V, feel free
to write one.  If you want to vectorize in hardware and actually get
serious performance out of it, I highly doubt his plan will work.  That
said, I wasn't planning to work on it so none of this is my problem so
you're welcome to take or leave anything I say. :-)

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] NLNet Funded development of a software/hardware MESA driver for the Libre GPGPU

2020-01-13 Thread Luke Kenneth Casson Leighton
On Monday, January 13, 2020, Jacob Lifshay  wrote:

> On Thu, Jan 9, 2020 at 3:56 AM Luke Kenneth Casson Leighton
>  wrote:
> >
>
> > jacob perhaps you could clarify, here?
>
> So the major issue with the approach AMDGPU took where the SIMT to
> predicated vector translation is done by the LLVM backend is that LLVM
> doesn't really maintain a reducible CFG, which is needed to correctly
> vectorize the code without devolving to a switch-in-a-loop.




> Hopefully, that all made sense. :)


yes :) as you're actively designing this you have a way better handle on it.

also, therefore, to be clear, to anyone interested in receiving funding to
do this work, you can see that there will be someone else to work with who
knows what they're doing, technically.

thank you jacob.

jason i'd be interested to hear your thoughts on what jacob wrote, does it
alleviate your concerns, (we're not designing hardware specifically around
vec2/3/4, it simply has that capability).

l.



-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] NLNet Funded development of a software/hardware MESA driver for the Libre GPGPU

2020-01-13 Thread Jacob Lifshay
On Thu, Jan 9, 2020 at 3:56 AM Luke Kenneth Casson Leighton
 wrote:
>
> On 1/9/20, Jason Ekstrand  wrote:
> >> 2. as a flexible Vector Processor, soft-programmable, then over time if
> >> the industry moves to dropping vec4, so can we.
> >>
> >
> > That's very nice.  My primary reason for sending the first e-mail was that
> > SwiftShader vs. Mesa is a pretty big decision that's hard to reverse after
> > someone has poured several months into working on a driver and the argument
> > you gave in favor of Mesa was that it supports vec4.
>
> not quite :)  i garbled it (jacob spent some time explaining it, a few
> months back, so it's 3rd hand if you know what i mean).  what i can
> recall of what he said was: it's something to do with the data types,
> particularly predication, being maintained as part of SPIR-V (and
> NIR), which, if you drop that information, you have to use
> auto-vectorisation and other rather awful tricks to get it back when
> you get to the assembly level.
>
> jacob perhaps you could clarify, here?

So the major issue with the approach AMDGPU took where the SIMT to
predicated vector translation is done by the LLVM backend is that LLVM
doesn't really maintain a reducible CFG, which is needed to correctly
vectorize the code without devolving to a switch-in-a-loop. This
kinda-sorta works for AMDGPU because the backend can specifically tell
the optimization passes to try to maintain a reducible CFG. However,
that won't work for Libre-RISCV's GPU because we don't have a separate
GPU ISA (it's just RISC-V or Power, we're still deciding), so the
backends don't tell the optimization passes that they need to maintain
a reducible CFG, additionally, the AMDGPU vectorization is done as
part of the translation from LLVM IR to MIR, which makes it very hard
to adapt to a different ISA. Because of all of those issues, I decided
that it would be better to vectorize before translating to LLVM IR,
since that way, the CFG reducibility can be easily maintained. This
also gives the benefit that it's much easier to substitute a different
backend compiler such as gccjit or cranelift, since all of the
required SIMT-specific transformations are already completed before
the code goes to the backend. Both NIR and the IR I'm currently
implementing in Kazan (the non-Mesa Vulkan driver for libre-riscv)
maintain a reducible CFG throughout the optimization process. In fact,
the IR I'm implementing can't express non-reducible CFGs since it's
built as a tree of loops and code blocks where control transfer
operations can only continue a loop or exit a loop or block. Switches
work by having a nested set of blocks and the switch instruction picks
which block to break out of.

Hopefully, that all made sense. :)

Jacob Lifshay
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev