Re: [go-nuts] Float32 math and slice arithmetics using SIMD

2016-10-26 Thread Ian Lance Taylor
On Wed, Oct 26, 2016 at 4:19 AM, Ondrej  wrote:
>
> 1) Should there be more support for float32 mathematics in Go? The math
> package offers a wide variety of functions, but when one operates on float32
> values, he needs to convert back and forth. I would normally use f64, but it
> brings me to my second point:
>
> 2) Should there be some higher level support for SIMD? Some SIMD
> instructions are in the Go assembly and are used in some performance
> critical code. What I propose is that these are exposed to the user through
> arithmetics on slices (meaning element by element operations on slices -
> e.g. add two slices of equal type and length). I started talking about f32
> in the previous point, because halving the bitsize (which is often
> sufficient), you can better utilise your registers in SIMD and thus achieve
> almost double the speed (one could go even further and utilise half
> precision arithmetics for further speed gains). There are three ways I've
> been looking at the issue, let me present them separately:
>
> 2a) The easiest way would be through adopting something akin to gonum's
> internal package (https://github.com/gonum/internal/tree/master/asm or
> perhaps its higher level wrapper https://github.com/gonum/floats), which
> essentially covers BLAS level 1 routines in both assembly for some systems
> and pure Go, so that there's a fallback. I would envisage element-by-element
> operations (addition, subtraction, multiplication, division) on slices of
> equal length and saved either to a newly allocated slice, a third supplied
> slice, or one of the two being operated on. Maybe scalar-vector operations
> (arithmetics and comparisons, e.g. quick slice scans in database-like
> scenarios), but not much beyond that, certainly not anything resembling the
> full BLAS. This whole thing could live in an x/ package.
>
> 2b) A more drastic way could be to allow arithmetics on slices explicitly -
> e.g. a := []float64{...}; b := []float64{...}; c := a + b. This, I think,
> leads to more problems than it solves. There is little control of memory
> allocation, likely runtime issues due to mismatched lengths, difficult
> handling of errors etc. Some of these issues, however, would not be relevant
> if this was allowed on *arrays*, not slices.
>
> 2c) A more aggressive SIMD usage when analysing tight loops might be handy.
> I don't know how often people loop through a numerical slice and do a single
> operation on it. I guess this discussion is more about the compiler and does
> not affect the user like the other proposals. (And I believe gccgo already
> does this?)
>
> ---
>
> I started thinking about this once the multidimensional slice proposal was
> fully developed. Because while it's not impossible to write SIMD-aware slice
> arithmetic routines, it become a whole lot tougher if you introduce multiple
> dimensions, subslicing, contiguous memory, strides and other fun stuff. It
> would be great if this was resolved on a language level.
>
> Currently, if one wants fast math in Go, it usually results in linking
> against OpenBLAS or some other implementation, using cgo, sometimes wrapping
> slices in structs, there are different implementation of matrices etc. Not
> to mention that these libraries tend to get compiled natively, which breaks
> portability. I envisaged a call to CPUID and then some bool tests along the
> way to utilise SSE[2-4]/AVX[2] (or NEON on ARM) if available. All in a
> static, portable package.
>
> I don't mean to talk about specific implementation, I just want to gauge if
> this is something that's within the language scope, something that would fit
> in the x/ packages, or something that should be left for 3rd party package
> writers.

My opinions.

Different processors have different SIMD operations.  I don't see how
to usefully add SIMD support to the language proper.

Adding SIMD support to the compiler is fine in principle, as long as
there is no real effect on compilation time.  That seems
uncontroversial.

Packages for SIMD support sound appropriate for third party packages.
Since they would be processor-specific by nature, I think it would be
premature to add them to x/.

Ian

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Float32 math and slice arithmetics using SIMD

2016-10-26 Thread Ondrej
Hi there,
I have two semi-proposals, which are borderline golang-nuts/dev, I'd be 
happy to flesh out either of them into a full-fledged proposal, if you guys 
see merit in them. I know SIMD has been discussed at least four times here, 
but never to a stage where it would result in a proposal. Here are my two 
main questions:

1) Should there be more support for float32 mathematics in Go? The math 
package offers a wide variety of functions, but when one operates on 
float32 values, he needs to convert back and forth. I would normally use 
f64, but it brings me to my second point:

2) Should there be some higher level support for SIMD? Some SIMD 
instructions are in the Go assembly and are used in some performance 
critical code. What I propose is that these are exposed to the user through 
arithmetics on slices (meaning element by element operations on slices - 
e.g. add two slices of equal type and length). I started talking about f32 
in the previous point, because halving the bitsize (which is often 
sufficient), you can better utilise your registers in SIMD and thus achieve 
almost double the speed (one could go even further and utilise half 
precision arithmetics for further speed gains). There are three ways I've 
been looking at the issue, let me present them separately:

2a) The easiest way would be through adopting something akin to gonum's 
internal package (https://github.com/gonum/internal/tree/master/asm or 
perhaps its higher level wrapper https://github.com/gonum/floats), which 
essentially covers BLAS level 1 routines in both assembly for some systems 
and pure Go, so that there's a fallback. I would envisage 
element-by-element operations (addition, subtraction, multiplication, 
division) on slices of equal length and saved either to a newly allocated 
slice, a third supplied slice, or one of the two being operated on. Maybe 
scalar-vector operations (arithmetics and comparisons, e.g. quick slice 
scans in database-like scenarios), but not much beyond that, certainly not 
anything resembling the full BLAS. This whole thing could live in an x/ 
package.

2b) A more drastic way could be to allow arithmetics on slices explicitly - 
e.g. a := []float64{...}; b := []float64{...}; c := a + b. This, I think, 
leads to more problems than it solves. There is little control of memory 
allocation, likely runtime issues due to mismatched lengths, difficult 
handling of errors etc. Some of these issues, however, would not be 
relevant if this was allowed on *arrays*, not slices.

2c) A more aggressive SIMD usage when analysing tight loops might be handy. 
I don't know how often people loop through a numerical slice and do a 
single operation on it. I guess this discussion is more about the compiler 
and does not affect the user like the other proposals. (And I believe gccgo 
already does this?)

---

I started thinking about this once the multidimensional slice proposal was 
fully developed. Because while it's not impossible to write SIMD-aware 
slice arithmetic routines, it become a whole lot tougher if you introduce 
multiple dimensions, subslicing, contiguous memory, strides and other fun 
stuff. It would be great if this was resolved on a language level.

Currently, if one wants fast math in Go, it usually results in linking 
against OpenBLAS or some other implementation, using cgo, sometimes 
wrapping slices in structs, there are different implementation of matrices 
etc. Not to mention that these libraries tend to get compiled natively, 
which breaks portability. I envisaged a call to CPUID and then some bool 
tests along the way to utilise SSE[2-4]/AVX[2] (or NEON on ARM) if 
available. All in a static, portable package.

I don't mean to talk about specific implementation, I just want to gauge if 
this is something that's within the language scope, something that would 
fit in the x/ packages, or something that should be left for 3rd party 
package writers.

Thanks,
Ondrej

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.