[racket-dev] TR optimizations and Racket overhead [was Re: [plt] Push #28633: master branch updated]

Neil Toronto Mon, 28 Apr 2014 13:39:15 -0700

On 04/28/2014 04:46 AM, Sam Tobin-Hochstadt wrote:


On Apr 28, 2014 12:16 AM, "Neil Toronto" <neil.toro...@gmail.com
<mailto:neil.toro...@gmail.com>> wrote:
 >
 > For anyone wondering what this stuff means: most mixed exact/inexact
math in Racket should be up to 5x or so faster now, depending on how
conversion-heavy it is and how large the numbers are. On my machine,
`real->double-flonum` is 1x or 25x faster on exact rationals (1x if the
numerator and denominator both fit in a flonum; we optimized the slow
path), and `inexact->exact` is 4x-100x faster (depending on the
magnitude of the flonum's exponent). Also, conversions from exact to
inexact now always find the closest approximation, so e.g. `sqrt` is a
little more accurate when given exact rationals.

Sounds great.

 > FWIW, the C implementations aren't much faster than the Typed Racket
prototypes. I find that totally awesome.

Impressive! Can you share the prototype?


Attached.

Also, what ends up slower? What would we have to fix to make them the
same speed?

After looking into it more, I've found that TR's optimizer isn't helpingmuch, probably because big integer operations account for most of thecost. Its largest effect seems to be reducing constant overhead in thetiming loops - which at least makes my analysis more accurate. :)

IOW, I think the prototypes are fast mostly because Racket is fast.There's probably not a lot more you can do to speed them up.


More details, as measured on my machine:

For rational-to-flonum conversions, the TR implementation is about 13%slower than C. It uses a different method to determine whether it canuse the fast algorithm (i.e. divide two flonums), which requires anextra conversion back to big integers, and this apparently accounts forabout 3% of the extra cost. Otherwise, it uses exactly the samealgorithm - it even mutates variables like in the C code.

It's harder to analyze flonum-to-rational conversions, because the TRimplementation has to use `real->floating-point-bytes` and`integer-bytes->integer`, but the C code can get the flonum's bits usinga cast. When I fix the input to get rid of this variable, the TRimplementation is 2% slower on numbers with negative exponents, and 10%*faster* on numbers with positive exponents. I suspect that when it'sfaster, it's because the C code is checking more cases, because the testcompares `inexact->exact` with `flonum->rational`.

In all, in these two programs, I think Racket overhead accounts for upto 10% of running time for TR-optimized code.


****

I think you might be interested to see how TR's optimizer does with codethat does a ridiculous amount of non-homogenous floating point. Therobust geometric primitives I'm working on are a perfect test case: Iuse only one data structure besides short flvectors and flonums (aplane, which contains one of each), and most computations are densesequences of flonum arithmetic that doesn't just map one operation overa vector.



   1 million applications on typical domain, in ms

  Function              #:no-optimize  opt'd   frac
  -------------------------------------------------
  flv3mag^2                      153     97     63%
  flv3mag                        273    192     70%
  flv3dot                        185    112     61%
  flv3cross                      386    273     71%
  flv3normalize                  830    571     69%
  flv3dist^2                     190    132     69%
  flv3dist                       311    220     71%
  flv3polygon-perp              1542   1136     74%
  flv3polygon-normal            2370   1710     72%
  flv3polygon-centroid           805    550     68%
  flplane3-point-dist            334    236     71%
  flplane3-line-isect-time      1102    955     87%
  flplane3-line-isect (seg)      747    533     71%
  flplane3-line-isect (no-seg)  1502   1079     72%

So I can get about 9.8 million robust dot products per second out of TR,and 5.4 million out of Racket.

These functions return results with <= 0.5ulp error by usingdouble-double representation (i.e. using two flonums per number inintermediate results). This makes them about 5x slower thanstraightforward implementations. I could probably get 45 million dirtydot products per second out of TR and 27 million out of Racket.

From these two cases (flonum/rational conversion and geometricprimitives) and other math-y things I've done in TR and Racket, Iconclude that


 1. When dealing with platform data types like flonums and small
    integers, it's not too hard to write TR code that's close to C in
    performance.

 2. When dealing with non-platform data types like big integers and
    exact rationals, even unoptimized Racket is close to C in
    performance.

I can't help but wonder how many of Racket's numeric primitives we couldmove into TR, and what it would take to do so.


Neil ⊥

#lang typed/racket

(require math/flonum)

(define flonum-implicit-one (arithmetic-shift 1 52))
(define flonum-subnormal-exp (arithmetic-shift 1 1074))

(define foo (let ([x  (random 51239)])
              (- x x)))

(: flonum->rational (-> Flonum Exact-Rational))
(define (flonum->rational x)
  ;; bits 63, 52-62, and 0-51
  (define-values (s e m) (flonum->fields x))
  (define r
    (cond [(= e 2047)
           ;; Special numbers aren't rational
           (raise-type-error 'flonum->rational "rational Flonum" x)]
          [(zero? e)
           ;; Subnormal numbers all have the same exponent, 2^-1074 = +min.0
           (/ m flonum-subnormal-exp)]
          [else
           ;; Adjust exponent for bias and mantissa size; add implicit one bit 
to mantissa
           (let ([e  (- e (+ 1023 52))]
                 [m  (bitwise-ior m flonum-implicit-one)])
             ;; Compute m * 2^e
             (if (< e 0)
                 (/ m (arithmetic-shift 1 (- e)))
                 (arithmetic-shift m e)))]))
  (if (zero? s) r (- r)))

;; 
===================================================================================================
;; Tests

(require math/base)

(define +one (flonum->ordinal 1.0))
(define -max (flonum->ordinal -max.0))
(define +inf (flonum->ordinal +inf.0))
(define (random-flonum)
  (ordinal->flonum (random-integer -max +inf)))

(: xs (Listof Flonum))
(define xs (list* -max.0 -1.0 -min.0 -0.0 0.0 +min.0 1.0 +max.0
                  (build-list 1000000 (λ (_) (random-flonum)))))
(define: bx : Any  #f)

(printf "Randomized testing of flonum->rational~n")
(time
 (for ([x  (in-list xs)])
   (unless (= (flonum->rational x) (inexact->exact x))
     (error 'bad "bad: ~v" x))))
(newline)

(printf "Speed test: inexact->exact~n")
(for ([_  (in-range 5)])
  (time (for ([x  (in-list xs)])
          (set! bx (inexact->exact x)))))
(newline)

(printf "Speed test: flonum->rational~n")
(for ([_  (in-range 5)])
  (time (for ([x  (in-list xs)])
          (set! bx (flonum->rational x)))))
(newline)

#lang typed/racket

(require racket/flonum)

(: make-flonum (-> Integer Integer Flonum))
(define (make-flonum e m)
  (fl* (->fl m) (flexpt 2.0 (->fl e))))

(: rounded-div (-> Natural Natural Natural))
;; (rounded-div n d) is equivalent to (round (/ n d)), but a lot faster
(define (rounded-div n d)
  (define-values (i f) (quotient/remainder n d))
  (define d/2 (arithmetic-shift d -1))
  (cond [(< f d/2)  i]
        [(> f d/2)  (+ i 1)]
        [(or (odd? d) (even? i))  i]
        [else  (+ i 1)]))

(define FLOAT_E_MIN -1074)
(define FLOAT_M_BITS 52)

(: rational->flonum (-> Exact-Rational Flonum))
(define (rational->flonum q)
  (cond
    [(negative? q)  (- (rational->flonum (- q)))]
    [else
     (define n (numerator q))
     (define d (denominator q))
     (define fn (->fl n))
     (define fd (->fl d))
     (cond
       [(and (< (abs fn) +inf.0) (= n (fl->exact-integer fn))
             (< (abs fd) +inf.0) (= d (fl->exact-integer fd)))
        ;; Numerator and denominator are exactly represented by flonums, so 
correctly rounded
        ;; division implies this result has no more than 0.5ulp error
        (/ fn fd)]
       [else
        ;; The following code was shamelessly stolen from 
racket/src/racket/src/rational.c
        (define nl (integer-length n))
        (define dl (integer-length d))
        (define p (- nl dl))
        (cond [(< p 0)  (set! n (arithmetic-shift n (- p)))]
              [else     (set! d (arithmetic-shift d p))])
        
        (when (< n d)
          (set! n (arithmetic-shift n 1))
          (set! p (- p 1)))
        
        (define shift (- p FLOAT_E_MIN))
        (when (> shift FLOAT_M_BITS)
          (set! shift FLOAT_M_BITS))
        
        (set! n (arithmetic-shift n shift))
        ;; But this part is new:
        (set! n (rounded-div (abs n) (abs d)))
        
        (make-flonum (- p shift) n)])]))

;; 
===================================================================================================
;; Tests

(require typed/rackunit
         math/base
         math/flonum)

(printf "Testing rounded-div~n")
(for ([n  (in-range 0 16)]
      [d  (in-range 1 16)])
  (check-= (round (/ n d)) (rounded-div (abs n) (abs d)) 0))

(printf "Testing single cases~n")
;; Sanity check
(check-= (rational->flonum 1/7) 0.14285714285714285 0)
(check-= (rational->flonum 9/7) 1.2857142857142858 0)
;; Cases that real->double-flonum gets wrong
(check-= (rational->flonum -13737024017780747/2813507303900) -4882.526517254422 
0)
(check-= (rational->flonum -1656/16910305547451097) -9.792844933246106e-14 0)
(newline)

(: random-rational (-> Exact-Rational))
(define (random-rational)
  (define d (random-bits (+ 1 (random 8192))))
  (cond [(zero? d)  (random-rational)]
        [else
         (* (if (< (random) 0.5) -1 1)
            (/ (random-bits (+ 1 (random 8192))) d))]))

(printf "Randomized testing of rational->flonum~n")
(define es
  (filter
   (λ (v) v)
   (for/list : (Listof Any) ([_  (in-range 10000)])
     (define ry (random-rational))
     (define y (rational->flonum ry))
     (define e (flulp-error y ry))
     (if (e . > . 0.5) (list e ry) #f))))
(check-equal? es empty)
(newline)

(printf "Speed test: 100000 conversions of large rationals~n")
(define ns (build-list 10000 (λ (_) (random-bits 8192))))
(define ds (build-list 10000 (λ (_) (random-bits 8192))))
(define qs (map / ns ds))

(define bx (flvector 0.0))

(printf "real->double-flonum~n")
(for ([_  (in-range 5)])
  (time (for* ([_  (in-range 10)]
               [q  (in-list qs)])
          (flvector-set! bx 0 (real->double-flonum q)))))
(newline)

(printf "rational->flonum~n")
(for ([_  (in-range 5)])
  (time (for* ([_  (in-range 10)]
               [q  (in-list qs)])
          (flvector-set! bx 0 (rational->flonum q)))))
(newline)

_________________________
  Racket Developers list:
  http://lists.racket-lang.org/dev

[racket-dev] TR optimizations and Racket overhead [was Re: [plt] Push #28633: master branch updated]

Reply via email to