Re: [Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

2021-10-06 Thread Tomas Kalibera



On 10/1/21 6:07 PM, Brodie Gaslam via R-devel wrote:

On Thursday, September 30, 2021, 01:25:02 PM EDT,  
wrote:


On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:

André,

I'm not an R core member, but happen to have looked a little bit at this
issue myself.  I've seen similar things on Skylake and Coffee Lake 2
(9700, one generation past your latest) too.  I think it would make sense
to have some handling of this, although I would want to show the trade-off
with performance impacts on CPUs that are not affected by this, and on
vectors that don't actually have NAs and similar.  I think the performance
impact is likely to be small so long as branch prediction is active, but
since branch prediction is involved you might need to check with different
ratios of NAs (not for your NA bailout branch, but for e.g. interaction
of what you add and the existing `na.rm=TRUE` logic).

I would want to see realistic examples where this matters, not
microbenchmarks, before thinking about complicating the code. Not all
but most cases where sum(x) returns NaN/NA would eventually result in
an error; getting to the error faster is not likely to be useful.

That's a very good point, and I'll admit I did not consider it
sufficiently.  There are examples such as `rowSums`/`colSums` where some
rows/columns evaluate to NA thus the result is still contains meaningful
data.  By extension, any loop that applies `sum` to list elements where
some might contain NAs, and others not.  `tapply` or any other group based
aggregation come to mind.


My understanding is that arm64 does not support proper long doubles
(they are the same as regular doubles).

Mine is the same.


Then there are issues with that "long double" (where not equivalent to 
"double") is implemented differently on different platforms, providing 
different properties. We have ran into that on Power, where "long 
double" it is implemented using a sum of doubles ("double-double"). If 
we could rely just on "double", we would not have to worry about such 
things.



So code using long doubles isn't getting the hoped-for improved
precision. Since that architecture is becoming more common we should
probably be looking at replacing uses of long doubles with better
algorithms that can work with regular doubles, e.g Kahan summation or
variants for sum.

This is probably the bigger issue.  If the future is ARM/AMD, the value of
Intel x87-only optimizations becomes questionable.

More generally is the question of whether to completely replace long
double with algorithmic precision methods, at a cost of performance on
systems that do support hardware long doubles (Intel or otherwise), or
whether both code pathways are kept and selected at compile time.  Or
maybe the aggregation functions gain a low-precision flag for simple
double precision accumulation.

I'm curious to look at the performance and precision implications of e.g.
Kahan summation if no one has done that yet.  Some quick poking around
shows people using processor specific intrinsics to take advantage of
advanced multi-element instructions, but I imagine R would not want to do
that.  Assuming others have not done this already, I will have a look and
report back.


Processor-specific (or even compiler-specific) code is better avoided, 
but sometimes it is possible to write portable code that is tailored to 
run fast on common platforms, while still working correctly with 
acceptable performance on other.


Sometimes one can give hints to the compiler via OpenMP pragmas to 
vectorize code and/or use vectorized instructions, e.g. when it is ok to 
ignore some specific corner cases (R uses this in mayHaveNaNOrInf to 
tell the compiler it is ok to assume associativity of addition in a 
specific loop/variable, hence allowing it to vectorize better).



Since we're on the topic I want to point out that the default NA in R
starts off as a signaling NA:

  example(numToBits)   # for `bitC`
  bitC(NA_real_)
  ## [1] 0 111 | 
00100010
  bitC(NA_real_ + 0)
  ## [1] 0 111 | 
10100010

Notice the leading bit of the significant starts off as zero, which marks
it as a signaling NA, but becomes 1, i.e. non-signaling, after any
operation[2].

This is meaningful because the mere act of loading a signaling NA into the
x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
actually used in arithmetic operations.  This happens sometimes under some
optimization levels.  I don't now of any benefit of starting off with a
signaling NA, especially since the encoding is lost pretty much as soon as
it is used.  If folks are interested I can provide patch to turn the NA
quiet by default.

In principle this might be a good idea, but the current bit pattern is
unfortunately baked into a number of packages and documents on
internals, as well as serialized objects. The work needed to sort that
out is probably 

Re: [Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

2021-10-01 Thread GILLIBERT, Andre

> Mildly related (?) to this discussion, if you happen to be in a situation
> where you know something is a C NAN, but need to check if its a proper R
> NA, the R_IsNA function is surprisingly (to me, at least) expensive to do
> in a tight loop because it calls the (again, surprisingly expensive to me)
> isnan function.  

What is your platform? CPU, OS, compiler?
How much expensive? 5-10 times slower than the improved code you wrote, or 
100-200 times slower?

I analyzed the C and assembly source code of R_IsNA on a x86_64 GNU/Linux 
computer (Celeron J1900) with GCC 5.4 and found that it was somewhat expensive, 
but the main problems did not seem to come from isnan.

isnan was only responsible of a ucomisd xmm0, xmm0 instruction followed by a 
conditional jump on x86_64. This instruction is slower on NAN than on normal 
FP, but it seems to have an acceptable speed.
On x86_32, the isnan is  responsible of a fld mem64, fst mem64, fucomip and 
conditional jump : it is suboptimal, but things could be worse.

On x86_64, the first problem I noticed is that R_IsNA is not inlined, and the 
registry-based x86_64 Linux calling convention is not necessarily good for that 
problem, with added loads/unloads from memory to registry.
Second problem (the worst part) : the write of a 64-bits double followed by the 
read of a 32-bits integer in the ieee_double union confuses the compiler, that 
generates very poor code, with unnecessary load/stores.

The second problem can be solved by using a union with a uint64_t and a double 
fields, and using &0x to extract the low part of the uint64_t. This 
works well for x86_64, but also for x86_32, where GCC avoids useless emulation 
of 64-bits integers, directly reading the 32-bits integer.

-- 
Sincerely
André GILLIBERT
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

2021-10-01 Thread Brodie Gaslam via R-devel
> On Thursday, September 30, 2021, 01:25:02 PM EDT,  
> wrote:
>
>> On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:
>>
>> André,
>>
>> I'm not an R core member, but happen to have looked a little bit at this
>> issue myself.  I've seen similar things on Skylake and Coffee Lake 2
>> (9700, one generation past your latest) too.  I think it would make sense
>> to have some handling of this, although I would want to show the trade-off
>> with performance impacts on CPUs that are not affected by this, and on
>> vectors that don't actually have NAs and similar.  I think the performance
>> impact is likely to be small so long as branch prediction is active, but
>> since branch prediction is involved you might need to check with different
>> ratios of NAs (not for your NA bailout branch, but for e.g. interaction
>> of what you add and the existing `na.rm=TRUE` logic).
>
> I would want to see realistic examples where this matters, not
> microbenchmarks, before thinking about complicating the code. Not all
> but most cases where sum(x) returns NaN/NA would eventually result in
> an error; getting to the error faster is not likely to be useful.

That's a very good point, and I'll admit I did not consider it
sufficiently.  There are examples such as `rowSums`/`colSums` where some
rows/columns evaluate to NA thus the result is still contains meaningful
data.  By extension, any loop that applies `sum` to list elements where
some might contain NAs, and others not.  `tapply` or any other group based
aggregation come to mind.

> My understanding is that arm64 does not support proper long doubles
> (they are the same as regular doubles).

Mine is the same.

> So code using long doubles isn't getting the hoped-for improved
> precision. Since that architecture is becoming more common we should
> probably be looking at replacing uses of long doubles with better
> algorithms that can work with regular doubles, e.g Kahan summation or
> variants for sum.

This is probably the bigger issue.  If the future is ARM/AMD, the value of
Intel x87-only optimizations becomes questionable.

More generally is the question of whether to completely replace long
double with algorithmic precision methods, at a cost of performance on
systems that do support hardware long doubles (Intel or otherwise), or
whether both code pathways are kept and selected at compile time.  Or
maybe the aggregation functions gain a low-precision flag for simple
double precision accumulation.

I'm curious to look at the performance and precision implications of e.g.
Kahan summation if no one has done that yet.  Some quick poking around
shows people using processor specific intrinsics to take advantage of
advanced multi-element instructions, but I imagine R would not want to do
that.  Assuming others have not done this already, I will have a look and
report back.

>> Since we're on the topic I want to point out that the default NA in R
>> starts off as a signaling NA:
>>
>> example(numToBits)   # for `bitC`
>> bitC(NA_real_)
>> ## [1] 0 111 | 
>>00100010
>> bitC(NA_real_ + 0)
>> ## [1] 0 111 | 
>>10100010
>>
>> Notice the leading bit of the significant starts off as zero, which marks
>> it as a signaling NA, but becomes 1, i.e. non-signaling, after any
>> operation[2].
>>
>> This is meaningful because the mere act of loading a signaling NA into the
>> x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
>> actually used in arithmetic operations.  This happens sometimes under some
>> optimization levels.  I don't now of any benefit of starting off with a
>> signaling NA, especially since the encoding is lost pretty much as soon as
>> it is used.  If folks are interested I can provide patch to turn the NA
>> quiet by default.
>
> In principle this might be a good idea, but the current bit pattern is
> unfortunately baked into a number of packages and documents on
> internals, as well as serialized objects. The work needed to sort that
> out is probably not worth the effort.

One reason why we might not need to sort this out is precisely the
instability shown above.  Anything that relies on the signaling bit set to
a particular value will behave differently with `NA_real_` vs
`NA_real_ + x`.  `R_IsNA` only checks the lower word, so it doesn't care
about the signaling bit or the 19 subsequent ones.  Anything that does
likely has unpredictable behavior **currently**.

Similarly, the documentation[1] only specifies the low word:

> On such platforms NA is represented by the NaN value with low-word 0x7a2
> (1954 in decimal).

This is consistent with the semantics of `R_IsNA`.

> It also doesn't seem to affect the performance issue here since
> setting b[1] <- NA_real_ + 0 produces the same slowdown (at least on
> my current Intel machine).

The subtlety here is that the slowdown happens by merely loading the
signaling NaN onto the 

Re: [Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread Gabriel Becker
Mildly related (?) to this discussion, if you happen to be in a situation
where you know something is a C NAN, but need to check if its a proper R
NA, the R_IsNA function is surprisingly (to me, at least) expensive to do
in a tight loop because it calls the (again, surprisingly expensive to me)
isnan function.  This can happen in known sorted  Altrep REALSXPs where you
can easily determine the C-NAN status of all elements in the vector with a
binary search for the edge of the NANs, so in O(logn) calls to R_isnan. You
could notably also determine finiteness of all elements this way with a
couple more O(logn) passes if you needed to in the sorted case.

This came up when I was developing the patch for the unique/duplicated
fastpass for known-sorted vectors (thanks to Michael for working with me on
that and putting it in); I ended up writing an NAN_IS_R_NA macro to avoid
that isnan call since it's known. This was necessary (well, helpful at
least) because unique/duplicated care about the difference between NA and
NaN, while sorting and REAL_NO_NA (because ALTREP metadata/behavior is
closely linked to sort behavior) do not. In the case where you have a lot
of NAN values of solely one type or the other (by far most often because
they are all NAs and none are NaNs) the difference in speedup was
noticeably significant as I recall. I don't have the numbers handy but I
could run them again if desired.

~G

On Thu, Sep 30, 2021 at 10:25 AM  wrote:

> On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:
>
> >
> > André,
> >
> > I'm not an R core member, but happen to have looked a little bit at this
> > issue myself.  I've seen similar things on Skylake and Coffee Lake 2
> > (9700, one generation past your latest) too.  I think it would make sense
> > to have some handling of this, although I would want to show the
> trade-off
> > with performance impacts on CPUs that are not affected by this, and on
> > vectors that don't actually have NAs and similar.  I think the
> performance
> > impact is likely to be small so long as branch prediction is active, but
> > since branch prediction is involved you might need to check with
> different
> > ratios of NAs (not for your NA bailout branch, but for e.g. interaction
> > of what you add and the existing `na.rm=TRUE` logic).
>
> I would want to see realistic examples where this matters, not
> microbenchmarks, before thinking about complicating the code. Not all
> but most cases where sum(x) returns NaN/NA would eventually result in
> an error; getting to the error faster is not likely to be useful.
>
> My understanding is that arm64 does not support proper long doubles
> (they are the same as regular doubles). So code using long doubles
> isn't getting the hoped-for improved precision. Since that
> architecture is becoming more common we should probably be looking at
> replacing uses of long doubles with better algorithms that can work
> with regular doubles, e.g Kahan summation or variants for sum.
>
> > You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
> > which might complicate the logic a fair bit.
> >
> > Presumably the x87 FPU will remain common for a long time, but if there
> > was reason to think otherwise, then the value of this becomes
> > questionable.
> >
> > Either way, I would probably wait to see what R Core says.
> >
> > For reference this 2012 blog post[1] discusses some aspects of the issue,
> > including that at least "historically" AMD was not affected.
> >
> > Since we're on the topic I want to point out that the default NA in R
> > starts off as a signaling NA:
> >
> > example(numToBits)   # for `bitC`
> > bitC(NA_real_)
> > ## [1] 0 111 |
> 00100010
> > bitC(NA_real_ + 0)
> > ## [1] 0 111 |
> 10100010
> >
> > Notice the leading bit of the significant starts off as zero, which marks
> > it as a signaling NA, but becomes 1, i.e. non-signaling, after any
> > operation[2].
> >
> > This is meaningful because the mere act of loading a signaling NA into
> the
> > x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
> > actually used in arithmetic operations.  This happens sometimes under
> some
> > optimization levels.  I don't now of any benefit of starting off with a
> > signaling NA, especially since the encoding is lost pretty much as soon
> as
> > it is used.  If folks are interested I can provide patch to turn the NA
> > quiet by default.
>
> In principle this might be a good idea, but the current bit pattern is
> unfortunately baked into a number of packages and documents on
> internals, as well as serialized objects. The work needed to sort that
> out is probably not worth the effort.
>
> It also doesn't seem to affect the performance issue here since
> setting b[1] <- NA_real_ + 0 produces the same slowdown (at least on
> my current Intel machine).
>
> Best,
>
> luke
>
> >
> > Best,
> >
> 

Re: [Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread luke-tierney

On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:



André,

I'm not an R core member, but happen to have looked a little bit at this
issue myself.  I've seen similar things on Skylake and Coffee Lake 2
(9700, one generation past your latest) too.  I think it would make sense
to have some handling of this, although I would want to show the trade-off
with performance impacts on CPUs that are not affected by this, and on
vectors that don't actually have NAs and similar.  I think the performance
impact is likely to be small so long as branch prediction is active, but
since branch prediction is involved you might need to check with different
ratios of NAs (not for your NA bailout branch, but for e.g. interaction
of what you add and the existing `na.rm=TRUE` logic).


I would want to see realistic examples where this matters, not
microbenchmarks, before thinking about complicating the code. Not all
but most cases where sum(x) returns NaN/NA would eventually result in
an error; getting to the error faster is not likely to be useful.

My understanding is that arm64 does not support proper long doubles
(they are the same as regular doubles). So code using long doubles
isn't getting the hoped-for improved precision. Since that
architecture is becoming more common we should probably be looking at
replacing uses of long doubles with better algorithms that can work
with regular doubles, e.g Kahan summation or variants for sum.


You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
which might complicate the logic a fair bit.

Presumably the x87 FPU will remain common for a long time, but if there
was reason to think otherwise, then the value of this becomes
questionable.

Either way, I would probably wait to see what R Core says.

For reference this 2012 blog post[1] discusses some aspects of the issue,
including that at least "historically" AMD was not affected.

Since we're on the topic I want to point out that the default NA in R
starts off as a signaling NA:

    example(numToBits)   # for `bitC`
    bitC(NA_real_)
    ## [1] 0 111 | 00100010
    bitC(NA_real_ + 0)
    ## [1] 0 111 | 10100010

Notice the leading bit of the significant starts off as zero, which marks
it as a signaling NA, but becomes 1, i.e. non-signaling, after any
operation[2].

This is meaningful because the mere act of loading a signaling NA into the
x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
actually used in arithmetic operations.  This happens sometimes under some
optimization levels.  I don't now of any benefit of starting off with a
signaling NA, especially since the encoding is lost pretty much as soon as
it is used.  If folks are interested I can provide patch to turn the NA
quiet by default.


In principle this might be a good idea, but the current bit pattern is
unfortunately baked into a number of packages and documents on
internals, as well as serialized objects. The work needed to sort that
out is probably not worth the effort.

It also doesn't seem to affect the performance issue here since
setting b[1] <- NA_real_ + 0 produces the same slowdown (at least on
my current Intel machine).

Best,

luke



Best,

B.

[1]: 
https://randomascii.wordpress.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/
[2]: https://en.wikipedia.org/wiki/NaN#Encoding






On Thursday, September 30, 2021, 06:52:59 AM EDT, GILLIBERT, Andre 
 wrote:

Dear R developers,

By default, R uses the "long double" data type to get extra precision for 
intermediate computations, with a small performance tradeoff.

Unfortunately, on all Intel x86 computers I have ever seen, long doubles 
(implemented in the x87 FPU) are extremely slow whenever a special 
representation (NA, NaN or infinities) is used; probably because it triggers 
poorly optimized microcode in the CPU firmware. A function such as sum() 
becomes more than hundred times slower!
Test code:
a=runif(1e7);system.time(for(i in 1:100)sum(a))
b=a;b[1]=NA;system.time(sum(b))

The slowdown factors are as follows on a few intel CPU:

1)  Pentium Gold G5400 (Coffee Lake, 8th generation) with R 64 bits : 140 
times slower with NA

2)  Pentium G4400 (Skylake, 6th generation) with R 64 bits : 150 times 
slower with NA

3)  Pentium G3220 (Haswell, 4th generation) with R 64 bits : 130 times 
slower with NA

4)  Celeron J1900 (Atom Silvermont) with R 64 bits : 45 times slower with NA

I do not have access to more recent Intel CPUs, but I doubt that it has 
improved much.

Recent AMD CPUs have no significant slowdown.
There is no significant slowdown on Intel CPUs (more recent than Sandy Bridge) 
for 64 bits floating point calculations based on SSE2. Therefore, operators 
using doubles, such as '+' are unaffected.

I do not know whether recent ARM CPUs have slowdowns on FP64... Maybe somebody 
can test.

Since NAs are not rare in real-life,