Re: [Rd] Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread GILLIBERT, Andre
Brodie Gaslam wrote:
> André,

> I'm not an R core member, but happen to have looked a little bit at this
> issue myself.  I've seen similar things on Skylake and Coffee Lake 2
> (9700, one generation past your latest) too.  I think it would make sense
> to have some handling of this, although I would want to show the trade-off
> with performance impacts on CPUs that are not affected by this, and on
> vectors that don't actually have NAs and similar.  I think the performance
> impact is likely to be small so long as branch prediction is active, but
> since branch prediction is involved you might need to check with different
> ratios of NAs (not for your NA bailout branch, but for e.g. interaction
> of what you add and the existing `na.rm=TRUE` logic).

For operators such as '+', randomly placed NAs could slow down AMD processor 
due to many branch mispredictions. However, the functions using long doubles 
are mainly "cumulative" functions such as sum, mean, cumsum, etc. They can stop 
at the first NA found.

> You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
> which might complicate the logic a fair bit.

When an infinity is found, the rest of the vector must be searched for 
infinities of the opposite side or NAs.
Mixing NaN and NAs has always been platform-dependent in R, but, on x87, it 
seems that the first NA/NaN met wins. Being consistent with that behavior is 
easy.

I wrote a first patch for the sum() function, taking in account all those 
special +Inf/-Inf/NA/NaN cases. Without any NA, the sum() function was 1.6% 
slower with my first patch on a Celeron J1900. Not a big deal, in my opinion.

However, I could easily compensate that loss (and much more) by improving the 
code.
By splitting the na.rm=TRUE and na.rm=FALSE code paths, I could save a useless 
FP comparison (for NAN), a useless CMOV (for the 'updated' variable) and 
useless SSE->memory->FP87 moves (high latency).

My new code is x1.75 times faster, for sum(big_vector_without_NAs, na.rm=FALSE).
It is x1.35 times faster for sum(big_vector_without_NAs, na.rm=TRUE)

Of course, it is much faster if there are any NA, because it stops at the first 
NA found. For infinities on AMD CPUs, it may not necessarily be faster.

-- 
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread brodie gaslam via R-devel


André,

I'm not an R core member, but happen to have looked a little bit at this
issue myself.  I've seen similar things on Skylake and Coffee Lake 2
(9700, one generation past your latest) too.  I think it would make sense
to have some handling of this, although I would want to show the trade-off
with performance impacts on CPUs that are not affected by this, and on
vectors that don't actually have NAs and similar.  I think the performance
impact is likely to be small so long as branch prediction is active, but
since branch prediction is involved you might need to check with different
ratios of NAs (not for your NA bailout branch, but for e.g. interaction
of what you add and the existing `na.rm=TRUE` logic).

You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
which might complicate the logic a fair bit.

Presumably the x87 FPU will remain common for a long time, but if there
was reason to think otherwise, then the value of this becomes
questionable.

Either way, I would probably wait to see what R Core says.

For reference this 2012 blog post[1] discusses some aspects of the issue,
including that at least "historically" AMD was not affected.

Since we're on the topic I want to point out that the default NA in R
starts off as a signaling NA:

    example(numToBits)   # for `bitC`
    bitC(NA_real_)
    ## [1] 0 111 | 00100010
    bitC(NA_real_ + 0)
    ## [1] 0 111 | 10100010

Notice the leading bit of the significant starts off as zero, which marks
it as a signaling NA, but becomes 1, i.e. non-signaling, after any
operation[2].

This is meaningful because the mere act of loading a signaling NA into the
x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
actually used in arithmetic operations.  This happens sometimes under some
optimization levels.  I don't now of any benefit of starting off with a
signaling NA, especially since the encoding is lost pretty much as soon as
it is used.  If folks are interested I can provide patch to turn the NA
quiet by default.

Best,

B.

[1]: 
https://randomascii.wordpress.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/
[2]: https://en.wikipedia.org/wiki/NaN#Encoding





> On Thursday, September 30, 2021, 06:52:59 AM EDT, GILLIBERT, Andre 
>  wrote:
>
> Dear R developers,
>
> By default, R uses the "long double" data type to get extra precision for 
> intermediate computations, with a small performance tradeoff.
>
> Unfortunately, on all Intel x86 computers I have ever seen, long doubles 
> (implemented in the x87 FPU) are extremely slow whenever a special 
> representation (NA, NaN or infinities) is used; probably because it triggers 
> poorly optimized microcode in the CPU firmware. A function such as sum() 
> becomes more than hundred times slower!
> Test code:
> a=runif(1e7);system.time(for(i in 1:100)sum(a))
> b=a;b[1]=NA;system.time(sum(b))
>
> The slowdown factors are as follows on a few intel CPU:
>
> 1)  Pentium Gold G5400 (Coffee Lake, 8th generation) with R 64 bits : 140 
> times slower with NA
>
> 2)  Pentium G4400 (Skylake, 6th generation) with R 64 bits : 150 times 
> slower with NA
>
> 3)  Pentium G3220 (Haswell, 4th generation) with R 64 bits : 130 times 
> slower with NA
>
> 4)  Celeron J1900 (Atom Silvermont) with R 64 bits : 45 times slower with 
> NA
>
> I do not have access to more recent Intel CPUs, but I doubt that it has 
> improved much.
>
> Recent AMD CPUs have no significant slowdown.
> There is no significant slowdown on Intel CPUs (more recent than Sandy 
> Bridge) for 64 bits floating point calculations based on SSE2. Therefore, 
> operators using doubles, such as '+' are unaffected.
>
> I do not know whether recent ARM CPUs have slowdowns on FP64... Maybe 
> somebody can test.
>
> Since NAs are not rare in real-life, I think that it would worth an extra 
> check in functions based on long doubles, such as sum(). The check for 
> special representations do not necessarily have to be done at each iteration 
> for cumulative functions.
> If you are interested, I can write a bunch of patches to fix the main 
> functions using long doubles: cumsum, cumprod, sum, prod, rowSums, colSums, 
> matrix multiplication (matprod="internal").
>
> What do you think of that?
>
> --
> Sincerely
> Andr� GILLIBERT
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread GILLIBERT, Andre
Dear R developers,

By default, R uses the "long double" data type to get extra precision for 
intermediate computations, with a small performance tradeoff.

Unfortunately, on all Intel x86 computers I have ever seen, long doubles 
(implemented in the x87 FPU) are extremely slow whenever a special 
representation (NA, NaN or infinities) is used; probably because it triggers 
poorly optimized microcode in the CPU firmware. A function such as sum() 
becomes more than hundred times slower!
Test code:
a=runif(1e7);system.time(for(i in 1:100)sum(a))
b=a;b[1]=NA;system.time(sum(b))

The slowdown factors are as follows on a few intel CPU:

1)  Pentium Gold G5400 (Coffee Lake, 8th generation) with R 64 bits : 140 
times slower with NA

2)  Pentium G4400 (Skylake, 6th generation) with R 64 bits : 150 times 
slower with NA

3)  Pentium G3220 (Haswell, 4th generation) with R 64 bits : 130 times 
slower with NA

4)  Celeron J1900 (Atom Silvermont) with R 64 bits : 45 times slower with NA

I do not have access to more recent Intel CPUs, but I doubt that it has 
improved much.

Recent AMD CPUs have no significant slowdown.
There is no significant slowdown on Intel CPUs (more recent than Sandy Bridge) 
for 64 bits floating point calculations based on SSE2. Therefore, operators 
using doubles, such as '+' are unaffected.

I do not know whether recent ARM CPUs have slowdowns on FP64... Maybe somebody 
can test.

Since NAs are not rare in real-life, I think that it would worth an extra check 
in functions based on long doubles, such as sum(). The check for special 
representations do not necessarily have to be done at each iteration for 
cumulative functions.
If you are interested, I can write a bunch of patches to fix the main functions 
using long doubles: cumsum, cumprod, sum, prod, rowSums, colSums, matrix 
multiplication (matprod="internal").

What do you think of that?

--
Sincerely
Andr� GILLIBERT

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel