Re: [Rd] Bug in perl=TRUE regexp matching?

2023-07-24 Thread Brodie Gaslam via R-devel




On 7/24/23 4:10 AM, Duncan Murdoch wrote:

On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:



On 7/23/23 4:29 PM, Duncan Murdoch wrote:

The help page for `?gsub` says (in the context of performance
considerations):


"... just one UTF-8 string will force all the matching to be done in
Unicode"


It's been a little while since I looked at the code but IIRC this just
means that strings are converted to UTF-8 before matching.  The problem
here seems to be more about the interpretation of the "\\w+" token by
PCRE.  I think this makes it a little clearer what's going on:

  gsub("\\w", "a", "Γ", perl=TRUE)
  [1] "Γ"

So no match.  The PCRE docs
https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
the old docs, but it works for our purposes here) mention we can turn on
unicode property matching with the "(*UCP)" token:

   gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
   [1] "a"

So there are two layers at play here.  The first one is whether R
converts strings to UTF-8, which I think is what the documentation is
about.  The other is whether the PCRE engine is configured to recognize
Unicode properties, which at least in both of our configurations for
this specific case it appears like it is not.


 From the surrounding context, I think the docs are talking about more 
than just conversion to UTF-8.  The full paragraph reads like this:


"If you are working in a single-byte locale (though not common since R 
4.2) and have marked UTF-8 strings that are representable in that 
locale, convert them first as just one UTF-8 string will force all the 
matching to be done in Unicode, which attracts a penalty of around

3× for the default POSIX 1003.2 mode."

i.e. it says the presence of UTF-8 strings slows things down by a factor 
of 3, so it's faster to convert everything to the local encoding.  If it 
was just conversion, I don't think that would be true.


But maybe "for the default POSIX 1003.2 mode" applies to the whole 
paragraph, not just to the penalty, so this is intentional.


Agreed, I don't think this whole issue is just about the conversion. 
What I'm trying to highlight is the distinction between what R does 
(converts input to Unicode - UTF-8 for PCRE[1], wchar_t for 
POSIX/TRE[2]), and what the regular expression engines then do (match 
that Unicode per their own semantics).  This for the case of any UTF-8 
in the input.


PCRE is behaving as documented[3]:

> By default, characters whose code points are greater than 127 never 
match \d, \s, or \w, and always match \D, \S, and \W, although this may 
be different for characters in the range 128-255 when locale-specific 
matching is happening. These escape sequences retain their original 
meanings from before Unicode support was available, mainly for 
efficiency reasons. If the PCRE2_UCP option is set, the behaviour is 
changed so that Unicode properties are used to determine character 
types, as follows...


So this doesn't seem like a bug to me.

Does that mean that the following is incorrect?

> one UTF-8 string will force all the matching to be done in Unicode

It depends on how you want to interpret "done in".  Less ambiguous could be:

> one UTF-8 string will force all strings to be converted to Unicode 
prior to matching.


Best,

B

[1]: 
https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1385
[2]: 
https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1378

[3]: https://pcre.org/current/doc/html/pcre2pattern.html



Duncan Murdoch


Best,

B.





However, this thread on SO:  https://stackoverflow.com/q/76749529 gives
some indication that this is not true for `perl = TRUE`.  Specifically:

  > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
"Jean-François Dupuis")
  > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
  > regex <- "\\B\\w+| +"
  > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

  > gsub(regex, "", strings, perl = TRUE)
[1] "85"  "JS"  "ΓιάννηςΠαπαδόπουλος"
"J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer
when the regex option /u ("match with full Unicode) is specified, but
the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this
looks like a flag may have been missed in the `perl = TRUE` case.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in perl=TRUE regexp matching?

2023-07-23 Thread Brodie Gaslam via R-devel




On 7/23/23 4:29 PM, Duncan Murdoch wrote:
The help page for `?gsub` says (in the context of performance 
considerations):



"... just one UTF-8 string will force all the matching to be done in 
Unicode"


It's been a little while since I looked at the code but IIRC this just 
means that strings are converted to UTF-8 before matching.  The problem 
here seems to be more about the interpretation of the "\\w+" token by 
PCRE.  I think this makes it a little clearer what's going on:


gsub("\\w", "a", "Γ", perl=TRUE)
[1] "Γ"

So no match.  The PCRE docs 
https://www.pcre.org/original/doc/html/pcrepattern.html (this might be 
the old docs, but it works for our purposes here) mention we can turn on 
unicode property matching with the "(*UCP)" token:


 gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
 [1] "a"

So there are two layers at play here.  The first one is whether R 
converts strings to UTF-8, which I think is what the documentation is 
about.  The other is whether the PCRE engine is configured to recognize 
Unicode properties, which at least in both of our configurations for 
this specific case it appears like it is not.


Best,

B.





However, this thread on SO:  https://stackoverflow.com/q/76749529 gives 
some indication that this is not true for `perl = TRUE`.  Specifically:


 > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος", 
"Jean-François Dupuis")

 > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
 > regex <- "\\B\\w+| +"
 > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

 > gsub(regex, "", strings, perl = TRUE)
[1] "85"  "JS"  "ΓιάννηςΠαπαδόπουλος" 
"J-FçoD"


and the website https://regex101.com/r/QDFrOE/1 gives the first answer 
when the regex option /u ("match with full Unicode) is specified, but 
the second answer when it is not.


Now I'm not at all sure that that website is authoritative, but this 
looks like a flag may have been missed in the `perl = TRUE` case.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in optim for specific orders of magnitude

2022-12-26 Thread Brodie Gaslam via R-devel
FWIW, I suspect the problem might have something to do with:

> 1/1e-308
[1] 1e+308
> 1/1e-309
[1] Inf
> 1e-309 > 0
[1] TRUE
> 1e-324 > 0
[1] FALSE

That is 1e-309 starts being treated as zero by division, but not by comparison 
(which happens at 1e-324).  This is not quite the range of values you were 
seeing the anomaly, but close.

This is on a 6th gen Skylake on MacOS.  

Best,

B.






On Monday, December 26, 2022 at 06:57:03 AM EST, Collin Erickson 
 wrote: 





Thanks for all of your replies.

I would like to clarify that I am not expecting optim to return a good
result in these cases, as I am aware of the difficulties that come with the
numerical precision for such small numbers. I would be happy for optim to
take zero steps and return the initial value after it sees the gradient as
close enough to 0. What I think is a bug is for optim to return an error
for a specific range of numbers between 1e-309 and 1e-320 while not giving
an error for values on either side. I think the behavior should be the same
as for smaller numbers where it simply returns the starting point after
estimating the gradient to be 0.

To give an idea of where I was seeing the error, consider a function that
is mostly near 0, but has some areas of interest, for example:
f <- function(x) {-exp(-sum(x^2))}

As long as the starting point is near the peak, optim works fine, but if
you give it a bad starting point it will see the function as flat and
return the starting point.

Again, it is fine that it is not able to find the optimum in the scenario
of starting far from the peak, but I believe that it should NOT return an
error when the initial value falls within a specific range. The bug is that
it returns an error, not that it fails to optimize.

Thanks,
Collin

On Fri, Dec 23, 2022 at 1:52 PM Duncan Murdoch 
wrote:

> The optim help page mentions scaling in the discussion of the "control"
> argument.  Specifically under the parscale description:
>
> "Optimization is performed on par/parscale and these should be
> comparable in the sense that a unit change in any element produces about
> a unit change in the scaled value."
>
> In your function a unit change in x[1] makes a change of 1e-317 in the
> function value, and changing x[2] has no effect at all.
>
> It would be nice if violating the rule only led to inefficiencies or
> poor stopping decisions, but the numbers you are working with are close
> to the hardware limits (the smallest positive number with full precision
> is .Machine$double.xmin, about 2e-308), and sometimes that means
> assumptions in the code about how arithmetic works are violated, e.g.
> things like x*1.1 > x may not be true for positive x below
> .Machine$double.xmin .
>
> Duncan Murdoch
>
> On 23/12/2022 12:30 p.m., Collin Erickson wrote:
> > Hello,
> >
> > I've come across what seems to be a bug in optim that has become a
> nuisance
> > for me.
> >
> > To recreate the bug, run:
> >
> > optim(c(0,0), function(x) {x[1]*1e-317}, lower=c(-1,-1), upper=c(1,1),
> > method='L-BFGS-B')
> >
> > The error message says:
> >
> > Error in optim(c(0, 0), function(x) { :
> >    non-finite value supplied by optim
> >
> > What makes this particularly treacherous is that this error only occurs
> for
> > specific powers. By running the following code you will find that the
> error
> > only occurs when the power is between -309 and -320; above and below that
> > work fine.
> >
> > p <- 1:1000
> > giveserror <- rep(NA, length(p))
> > for (i in seq_along(p)) {
> >    tryout <- try({
> >      optim(c(0,0), function(x) {x[1]*10^-p[i]}, lower=c(-1,-1),
> > upper=c(1,1), method='L-BFGS-B')
> >    })
> >    giveserror[i] <- inherits(tryout, "try-error")
> > }
> > p[giveserror]
> >
> > Obviously my function is much more complex than this and usually doesn't
> > fail, but this reprex demonstrates that this is a problem. To avoid the
> > error I may multiply by a factor or take the log, but it seems like a
> > legitimate bug that should be fixed.
> >
> > I tried to look inside of optim to track down the error, but the error
> lies
> > within the external C code:
> >
> > .External2(C_optim, par, fn1, gr1, method, con, lower,
> >          upper)
> >
> > For reference, I am running R 4.2.2, but was also able to recreate this
> bug
> > on another device running R 4.1.2 and another running 4.0.3.
> >
> > Thanks,
> > Collin Erickson
> >
> >      [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

>
>

    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] parent.frame(n) produces different result from sys.frame(sys.parent(n))

2022-09-28 Thread Brodie Gaslam via R-devel
Hi Taras,

I have not looked in detail at your examples here, but the
use of evalq and sys.parent makes me think these issues:

https://bugs.r-project.org/show_bug.cgi?id=17849
https://bugs.r-project.org/show_bug.cgi?id=15531

are possibly related.

Best,

B.



On Wednesday, September 28, 2022 at 05:29:08 AM EDT, Taras Zakharko 
 wrote: 





Dear all, 

The documentation states that parent.frame(n) is equivalent to 
sys.frame(sys.parent(n)) but I have discovered a case where they produce 
different results. Before I submit a bug report I thought it would be good to 
run it by the R community in case it’s (somehow?) expected behaviour. Consider 
the following MRE (this is R 4.2.1 running on Apple M1):

  f1 <- function() {
    f2()
  }
  
  f2 <- function() {
    f3()
  }
  
  f3 <- function() {
    evalq(check_parents(), parent.frame())
  }
  
  check_parents <- function() {
    print(vctrs::data_frame(
      call = as.list(sys.calls()),
      frame = as.list(sys.frames()),
      parent = as.list(sys.parents())
    ))
  
    print(parent.frame(2L))
    print(sys.frame(sys.parent(2L)))
  }
  
  f1()

This produces

                                    call                          frame         
                                 parent
1                                  f1()                              0
2                                  f2()                              1
3                                  f3()                              2
4 evalq(check_parents(), parent.frame())        3
5 evalq(check_parents(), parent.frame())        4
6                        check_parents()                    2
  # parent.frame(2L)
 # sys.frame(sys.parent(2L))

It seems like parent.frame(2L) resolves to frame 4 which is not part of the 
call stack of frame 6 at all. I haven’t yet looked at the C code. 

Best, 

Taras 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] gsub() hex character range problems in R-devel?

2022-01-04 Thread Brodie Gaslam via R-devel
> On Tuesday, January 4, 2022, 02:35:50 PM EST, Martin Morgan 
>  wrote:
>
> I'm not very good at character encoding / etc so this might be user
> error. The following code is meant to replace extended ASCII characters,
> in particular a non-breaking space, with "", and it works in
> R-4-1-branch

Martin,

I'm (obviously) not R-Core, so you should take whatever I say with a grain
of salt.  Nonetheless I have run into a similar issue as you, and my
assessment is that the behavior in R-4-1-2 is due to a bug that was fixed
with -r81103 for R-devel only.  It only appears more correct due to
happenstance and "surprising" (at least to me) behavior from the
"corrected" code.

But before I get into the details, I'd be remiss not to add some warnings
about using arbitrary bytes in strings as you do here.  The strings in
your examples are not marked:

    Encoding("fo\xa0o")
    [1] "unknown"

This means internals may interpret them as being in native encoding (UTF-8
in your case, in which your string is invalid).  If you want to use byte
operations you should mark your strings as "bytes" / use the "useBytes"
parameter to the functions in question (and assume all the consequences of
generating invalid encodings), or even better translate the string from its
actual encoding to your encoding.  For your case assuming you have
ISO-8859-1 encoding (I'm just guessing) I would do:

    x <- "fo\xa0o"
    y <- iconv(x, "ISO-8859-1", "UTF-8")
    gsub("\ua0", "", y)
    [1] "foo"

You could also just have marked your string as "latin1" as for 0xA0 it is
the same as ISO-8859-1 and gotten the same result without `iconv`, but the
`iconv` solution is more general.

I'll address the two examples in reverse order as the first one
is more obvious.

> > gsub("[[:alnum:]]", "", "fo\xa0o")  # R-4-1-branch
> [1] "\xfc\xbe\x8c\x86\x84\xbc"
>
> > gsub("[[:alnum:]]", "", "fo\xa0o")  # R-devel
> [1] "<>"

The result in the 4-1 contains bytes not present in the input.  Clearly
this cannot be correct.  R-devel is "correct" if you account for the
surprising (to me) behavior that invalid bytes in UTF-8 interpreted
strings may be escaped in pre-processing.  This is roughly what's
happening:

    "fo\xa0o" -> "foo" -> gsub("[[:alnum:]]", "", "foo") -> "<>"

Where "" is the escaped version of the "\xa0".  It's clearer if you do
(R-devel):

    gsub("f", "", "fo\xa0o")
    [1] "oo"

I do think this "correct" behavior would be better as an error or at a
minimum a warning, and hopefully this is something that will change in the
future.

> > R.version.string
> [1] "R version 4.1.2 Patched (2022-01-04 r81445)"
> > gsub("[\x7f-\xff]", "", "fo\xa0o")
> [1] "foo"
>
> but fails in R-devel
> > R.version.string
> [1] "R Under development (unstable) (2022-01-04 r81445)"
> > gsub("[\x7f-\xff]", "", "fo\xa0o")
> Error in gsub("[\177-\xff]", "", "fo\xa0o") : invalid regular expression 
> '[-�]', reason 'Invalid character range'
> In addition: Warning message:
> In gsub("[\177-\xff]", "", "fo\xa0o") :
>   TRE pattern compilation error 'Invalid character range'

This one is pretty interesting.  The same bug persists, but because it
affects both the pattern and the string to manipulate the bugs cancel out.
If you look at what's happening internally in R-4-1, the range "\x7f-\xff"
is translated to "\u7f-\U{3e3c}", but "fo\xa0o" is also translated to
"fo\U{3e30613c}o", so it happens to work.

Why "\U{3e3c}"?  Well, it's really 3e 66 66 3c, which the code
intended to have interpreted as < f f >.  In ASCII encoding, we have 3e =
<, 66 = f, 3c = >.  So the intent was to write out "", the 4 character
escape for the single byte "\xff".  Instead, the 4 bytes are written into
a single wchar_t (on systems with 32bit wchar_t) and interpreted as that
code point.

In little-endian machines like ours, the double cancellation does not
always work as seen in R-4-1-2:

    gsub("[\x7f-\xab]", "",  "\xab")
    ## [1] ""
    gsub("[\x7f-\xba]", "",  "\xab")  # changed end to be \xba
    ## [1] "\xab"

One would expect the second range to still capture the character, but
because wchar_t is interpreted little endian the order of the "a" and "b"
written into the wchar_t is opposite of what is desired.  So it would not
be possible to leave the bug in (even if it didn't cause other issues) on
the grounds it cancels itself out.

With the patch applied in R-devel, the range "[\x7f-\xff]" becomes
"[\x7f-]", which is invalid because "<" has a lower code point that
"\x7f".  Here the fix exposes the "surprisingness" of the current
behavior.

Although again, you can currently side-step all this simply by
converting everything into valid encodings and avoiding bytes
manipulation, or doing everything very carefully explicitly with "bytes"
marked strings and "useBytes=TRUE".

Best,

B.

> The R-devel sessionInfo is
>
> > sessionInfo()
> R Under development (unstable) (2022-01-04 r81445)
> Platform: x86_64-apple-darwin19.6.0 (64-bit)
> Running under: macOS Catalina 10.15.7
>
> Matrix produ

Re: [Rd] [External] Re: Workaround very slow NAN/Infinities arithmetic?

2021-10-01 Thread Brodie Gaslam via R-devel
> On Thursday, September 30, 2021, 01:25:02 PM EDT,  
> wrote:
>
>> On Thu, 30 Sep 2021, brodie gaslam via R-devel wrote:
>>
>> André,
>>
>> I'm not an R core member, but happen to have looked a little bit at this
>> issue myself.  I've seen similar things on Skylake and Coffee Lake 2
>> (9700, one generation past your latest) too.  I think it would make sense
>> to have some handling of this, although I would want to show the trade-off
>> with performance impacts on CPUs that are not affected by this, and on
>> vectors that don't actually have NAs and similar.  I think the performance
>> impact is likely to be small so long as branch prediction is active, but
>> since branch prediction is involved you might need to check with different
>> ratios of NAs (not for your NA bailout branch, but for e.g. interaction
>> of what you add and the existing `na.rm=TRUE` logic).
>
> I would want to see realistic examples where this matters, not
> microbenchmarks, before thinking about complicating the code. Not all
> but most cases where sum(x) returns NaN/NA would eventually result in
> an error; getting to the error faster is not likely to be useful.

That's a very good point, and I'll admit I did not consider it
sufficiently.  There are examples such as `rowSums`/`colSums` where some
rows/columns evaluate to NA thus the result is still contains meaningful
data.  By extension, any loop that applies `sum` to list elements where
some might contain NAs, and others not.  `tapply` or any other group based
aggregation come to mind.

> My understanding is that arm64 does not support proper long doubles
> (they are the same as regular doubles).

Mine is the same.

> So code using long doubles isn't getting the hoped-for improved
> precision. Since that architecture is becoming more common we should
> probably be looking at replacing uses of long doubles with better
> algorithms that can work with regular doubles, e.g Kahan summation or
> variants for sum.

This is probably the bigger issue.  If the future is ARM/AMD, the value of
Intel x87-only optimizations becomes questionable.

More generally is the question of whether to completely replace long
double with algorithmic precision methods, at a cost of performance on
systems that do support hardware long doubles (Intel or otherwise), or
whether both code pathways are kept and selected at compile time.  Or
maybe the aggregation functions gain a low-precision flag for simple
double precision accumulation.

I'm curious to look at the performance and precision implications of e.g.
Kahan summation if no one has done that yet.  Some quick poking around
shows people using processor specific intrinsics to take advantage of
advanced multi-element instructions, but I imagine R would not want to do
that.  Assuming others have not done this already, I will have a look and
report back.

>> Since we're on the topic I want to point out that the default NA in R
>> starts off as a signaling NA:
>>
>> example(numToBits)   # for `bitC`
>> bitC(NA_real_)
>> ## [1] 0 111 | 
>>00100010
>> bitC(NA_real_ + 0)
>> ## [1] 0 111 | 
>>10100010
>>
>> Notice the leading bit of the significant starts off as zero, which marks
>> it as a signaling NA, but becomes 1, i.e. non-signaling, after any
>> operation[2].
>>
>> This is meaningful because the mere act of loading a signaling NA into the
>> x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
>> actually used in arithmetic operations.  This happens sometimes under some
>> optimization levels.  I don't now of any benefit of starting off with a
>> signaling NA, especially since the encoding is lost pretty much as soon as
>> it is used.  If folks are interested I can provide patch to turn the NA
>> quiet by default.
>
> In principle this might be a good idea, but the current bit pattern is
> unfortunately baked into a number of packages and documents on
> internals, as well as serialized objects. The work needed to sort that
> out is probably not worth the effort.

One reason why we might not need to sort this out is precisely the
instability shown above.  Anything that relies on the signaling bit set to
a particular value will behave differently with `NA_real_` vs
`NA_real_ + x`.  `R_IsNA` only checks the lower word, so it doesn't care
about the signaling bit or the 19 subsequent ones.  Anything that does
likely has unpredictable behavior **currently**.

Similarly, the documentation[1] only specifies the low word:

> On such platforms NA is represented by the NaN value with low-word 0x7a2
>

Re: [Rd] Workaround very slow NAN/Infinities arithmetic?

2021-09-30 Thread brodie gaslam via R-devel


André,

I'm not an R core member, but happen to have looked a little bit at this
issue myself.  I've seen similar things on Skylake and Coffee Lake 2
(9700, one generation past your latest) too.  I think it would make sense
to have some handling of this, although I would want to show the trade-off
with performance impacts on CPUs that are not affected by this, and on
vectors that don't actually have NAs and similar.  I think the performance
impact is likely to be small so long as branch prediction is active, but
since branch prediction is involved you might need to check with different
ratios of NAs (not for your NA bailout branch, but for e.g. interaction
of what you add and the existing `na.rm=TRUE` logic).

You'll also need to think of cases such as c(Inf, NA), c(NaN, NA), etc.,
which might complicate the logic a fair bit.

Presumably the x87 FPU will remain common for a long time, but if there
was reason to think otherwise, then the value of this becomes
questionable.

Either way, I would probably wait to see what R Core says.

For reference this 2012 blog post[1] discusses some aspects of the issue,
including that at least "historically" AMD was not affected.

Since we're on the topic I want to point out that the default NA in R
starts off as a signaling NA:

    example(numToBits)   # for `bitC`
    bitC(NA_real_)
    ## [1] 0 111 | 00100010
    bitC(NA_real_ + 0)
    ## [1] 0 111 | 10100010

Notice the leading bit of the significant starts off as zero, which marks
it as a signaling NA, but becomes 1, i.e. non-signaling, after any
operation[2].

This is meaningful because the mere act of loading a signaling NA into the
x87 FPU is sufficient to trigger the slowdowns, even if the NA is not
actually used in arithmetic operations.  This happens sometimes under some
optimization levels.  I don't now of any benefit of starting off with a
signaling NA, especially since the encoding is lost pretty much as soon as
it is used.  If folks are interested I can provide patch to turn the NA
quiet by default.

Best,

B.

[1]: 
https://randomascii.wordpress.com/2012/05/20/thats-not-normalthe-performance-of-odd-floats/
[2]: https://en.wikipedia.org/wiki/NaN#Encoding





> On Thursday, September 30, 2021, 06:52:59 AM EDT, GILLIBERT, Andre 
>  wrote:
>
> Dear R developers,
>
> By default, R uses the "long double" data type to get extra precision for 
> intermediate computations, with a small performance tradeoff.
>
> Unfortunately, on all Intel x86 computers I have ever seen, long doubles 
> (implemented in the x87 FPU) are extremely slow whenever a special 
> representation (NA, NaN or infinities) is used; probably because it triggers 
> poorly optimized microcode in the CPU firmware. A function such as sum() 
> becomes more than hundred times slower!
> Test code:
> a=runif(1e7);system.time(for(i in 1:100)sum(a))
> b=a;b[1]=NA;system.time(sum(b))
>
> The slowdown factors are as follows on a few intel CPU:
>
> 1)  Pentium Gold G5400 (Coffee Lake, 8th generation) with R 64 bits : 140 
> times slower with NA
>
> 2)  Pentium G4400 (Skylake, 6th generation) with R 64 bits : 150 times 
> slower with NA
>
> 3)  Pentium G3220 (Haswell, 4th generation) with R 64 bits : 130 times 
> slower with NA
>
> 4)  Celeron J1900 (Atom Silvermont) with R 64 bits : 45 times slower with 
> NA
>
> I do not have access to more recent Intel CPUs, but I doubt that it has 
> improved much.
>
> Recent AMD CPUs have no significant slowdown.
> There is no significant slowdown on Intel CPUs (more recent than Sandy 
> Bridge) for 64 bits floating point calculations based on SSE2. Therefore, 
> operators using doubles, such as '+' are unaffected.
>
> I do not know whether recent ARM CPUs have slowdowns on FP64... Maybe 
> somebody can test.
>
> Since NAs are not rare in real-life, I think that it would worth an extra 
> check in functions based on long doubles, such as sum(). The check for 
> special representations do not necessarily have to be done at each iteration 
> for cumulative functions.
> If you are interested, I can write a bunch of patches to fix the main 
> functions using long doubles: cumsum, cumprod, sum, prod, rowSums, colSums, 
> matrix multiplication (matprod="internal").
>
> What do you think of that?
>
> --
> Sincerely
> Andr� GILLIBERT
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Spurious warnings in coercion from double/complex/character to raw

2021-09-10 Thread brodie gaslam via R-devel


> On Friday, September 10, 2021, 03:13:54 PM EDT, Hervé Pagès 
>  wrote:
>
> Good catch, thanks!
>
> Replacing
>
> if(ISNAN(vi) || (tmp = (int) vi) < 0 || tmp > 255) {
> tmp = 0;
>
> warn |= WARN_RAW;
>
> }
> pa[i] = (Rbyte) tmp;
>
> with
>
> if(ISNAN(vi) || vi <= -1.0 || vi >= 256.0)
>   {
> tmp = 0;
>
> warn |= WARN_RAW;
>
> } else {
> tmp = (int) vi;
> }
> pa[i] = (Rbyte) tmp;
>
> should address that.
>
> FWIW IntegerFromReal() has a similar risk of int overflow
> (src/main/coerce.c, lines 128-138):
>
>
>   int attribute_hidden
>
>   IntegerFromReal(double x, int *warn)
>
>   {
>
>   if (ISNAN(x))
>
>   return NA_INTEGER;
>
>   else if (x >= INT_MAX+1. || x <= INT_MIN ) {
>
>   *warn |= WARN_INT_NA;
>
>   return NA_INTEGER;
>
>   }
>
>   return (int) x;
>
>   }
>
>
>
> The cast to int will also be an int overflow situation if x is > INT_MAX
> and < INT_MAX+1 so the risk is small!

I might be being dense, but it feels this isn't a problem?  Quoting C99
6.3.1.4 again (emph added):

> When a finite value of real floating type is converted to an integer
> type other than _Bool, **the fractional part is discarded** (i.e., the
> value is truncated toward zero). If the value of the integral part
> cannot be represented by the integer type, the behavior is undefined.50)

Does the "fractional part is discarded" not save us here?

Best,

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread brodie gaslam via R-devel
> On Sunday, June 20, 2021, 9:29:28 PM EDT, brodie gaslam via R-devel 
>  wrote:
>
>> On Sunday, June 20, 2021, 6:21:22 PM EDT, Michael Chirico 
>>  wrote:
>>
>> The max width of a string is .Machine$integer.max-1:
>
> I think the max width is .Machine$integer.max.  What happened below is a
> bug due to buffer overflow in `strrep`:

Sorry, integer overflow.

>> # works
>> x = strrep(" ", .Machine$integer.max-1L)
>> # fails
>> x = strrep(" ", .Machine$integer.max)
>> Error in strrep(" ", .Machine$integer.max) :
>>   'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)
>> (see also the comment in src/main/character.c: "Character strings in R
>> are less than 2^31-1 bytes, so we use int not size_t.")
>
> FWIW WRE states:
>
>> Note that R character strings are restricted to 2^31 - 1 bytes
>
> This is INT_MAX or .Machine$integer.max, at least on machines for which
> `int` is 32 bits, which I think typical for machines R builds on.   From
> having looked at the code a while ago I think WRE is right (so maybe the
> comment in the code is wrong), but it was a while ago and I haven't tried
> to allocate an INT_MAX long string.

So I tried it on a machine with more memory, and it works:

    > x <- strrep(" ", .Machine$integer.max-1L)
    > x <- paste0(x, " ")
    > nchar(x)
    [1] 2147483647
    > nchar(x) == .Machine$integer.max
    [1] TRUE

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread brodie gaslam via R-devel


> On Sunday, June 20, 2021, 6:21:22 PM EDT, Michael Chirico 
>  wrote:
>
> Currently, substring defaults to last=100L, which strongly
> suggests the intent is to default to "nchar(x)" without having to
> compute/allocate that up front.
>
> Unfortunately, this default makes no sense for "very large" strings
> which may exceed 100L in "width".
>
> The max width of a string is .Machine$integer.max-1:

I think the max width is .Machine$integer.max.  What happened below is a
bug due to buffer overflow in `strrep`:

> # works
> x = strrep(" ", .Machine$integer.max-1L)
> # fails
> x = strrep(" ", .Machine$integer.max)
> Error in strrep(" ", .Machine$integer.max) :
>   'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

Notice the very large number that was tried to be Calloc'ed.  That's
(size_t) -1.

The problem is (src/include/R_ext/RS.h@85):

    #define CallocCharBuf(n) (char *) R_chk_calloc((R_SIZE_T) ((n)+1), 
sizeof(char))

The `((n) + 1)` overflows `int` and produces -1 (well, undefined behavior
so who knows), which when cast to size_t produces that very large number
which can't be allocated.

I think this should be:

    #define CallocCharBuf(n) (char *) R_chk_calloc(((R_SIZE_T)(n))+1, 
sizeof(char))

I can reproduce the failure before the change.  After the change I get:

    > x = strrep(" ", .Machine$integer.max)
    Error in strrep(" ", .Machine$integer.max) :
  'Calloc' could not allocate memory (2147483648 of 1 bytes)

I believe this to be the expected result on a machine that doesn't have
enough memory to allocate INT_MAX + 1 bytes, as happens to be the case on
my R build system (it's a VM that gets 2GB total as the host machine can
barely spare that to begin with).

> (see also the comment in src/main/character.c: "Character strings in R
> are less than 2^31-1 bytes, so we use int not size_t.")

FWIW WRE states:

> Note that R character strings are restricted to 2^31 - 1 bytes

This is INT_MAX or .Machine$integer.max, at least on machines for which
`int` is 32 bits, which I think typical for machines R builds on.   From
having looked at the code a while ago I think WRE is right (so maybe the
comment in the code is wrong), but it was a while ago and I haven't tried
to allocate an INT_MAX long string.

Sorry this doesn't answer your original question.

Best,

Brodie.

>
>
> So it seems to me either .Machine$integer.max or
> .Machine$integer.max-1L would be a more sensible default. Am I missing
> something?
>
> Mike C
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to get utf8 string using R externals

2021-06-02 Thread brodie gaslam via R-devel


> On Wednesday, June 2, 2021, 7:58:54 PM EDT, xiaoyan yu  
> wrote:
>
> I am using gmail. Not sure of the configuration of plain text.
> The memory pointed by the char * as the output of Rf_translateChar() is
> actually the string "".

Hi Xiaoyan,

Unfortunately I'm not super familiar with R on Windows, but I think
I can provide a simpler reproducible example.  In Rgui, if I type "\UBD80"
at the prompt and hit enter, I see the desired glyph.  In Rterm I see the
unicode escape.

IIRC the capabilities of Rterm and Rgui are different, and UTF8 support
in windows is limited.  Tomas Kalibera discusses this in some detail:

https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html

In terms of `Rf_translateChar()`, presumably the `Riconv` call is failing
on Rterm, but not on Rgui:

https://github.com/r-devel/r-svn/blob/master/src/main/sysutils.c#L924

I'm guessing, but that would explain why the C level string is in that
format.  I don't know why the string would translate in Rgui though.  My
guess is that it did not as even in Rgui the following:

    enc2native("\uBD80")

Produces the escaped version of the string.

As others have suggested you could try the experimental UCRT Windows release:

https://developer.r-project.org/Blog/public/2021/03/12/windows/utf-8-toolchain-and-cran-package-checks/index.html

Install instructions (focus on Binary installer):

https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/ucrt3/howto.html

If I try UCRT on my system this no longer produces the escape:

    enc2native("\uBD80")

Although all I see is a question mark.  My guess is that my code page or
something similar is not set right.  Examining with `charToRaw` reveals
the string remains in UTF-8 encoding.

Aside: it's not clear to me that you need to translate the string if your
intent is for it to remain UTF-8.  You just don't seem to be set-up to
interpret UTF-8 strings currently.

Best,

B

> On Wed, Jun 2, 2021 at 6:09 PM David Winsemius 
> wrote:
>
>> First; you should configure yopu mail client to send plain text.
>>
>> Can you explain what is meant by:
>>
>> the characters are unicodes () instead of
>> utf8 encoding of the korean characters 부실.
>>
>> As far as I can tell those two unicodes _are_ the utf8 encodings of 부실.
>>
>> You may need to consult a couple of R help pages. I suggest:
>>
>> ?Quotes
>> ?points  # has examples of changing fonts used for display on console.
>>
>> Sorry if I've misunderstood. I'm not on a Windows device, so  posting the
>> C++ program won't be helpful, but maybe it would for other prospective
>> respondents.
>>
>> --
>> David.
>>
>> On 6/2/21 1:33 PM, xiaoyan yu wrote:
>> > I have a R Script Predict.R:
>> >  set.seed(42)
>> >  C <- seq(1:1000)
>> >  A <- rep(seq(1:200),5)
>> >  E <- (seq(1:1000) * (0.8 + (0.4*runif(50, 0, 1
>> >  L <- ifelse(runif(1000)>.5,1,0)
>> >  df <- data.frame(cbind(C, A, E, L))
>> > load("C:/Temp/tree.RData")    #  load the model for scoring
>> >
>> >    P <- as.character(predict(tree_model_1,df,type='class'))
>> >
>> > Then in a C++ program
>> > I call eval to evaluate the script and then findVar the P variable.
>> > After get each class label from P using string_elt and then
>> > Rf_translateChar, the characters are unicodes () instead
>> of
>> > utf8 encoding of the korean characters 부실.
>> > Can I know how to get UTF8 by using R externals?
>> >
>> > I also found the same script giving utf8 characters in RGui but unicode
>> in
>> > Rterm.
>> > I tried to attach a screenshot but got message "The message's content
>> type
>> > was not explicitly allowed"
>> > In RGui, I saw the output 부실, while in Rterm, .
>> >
>> > Please help.
>> >
>> >  [[alternative HTML version deleted]]
>> >
>> > __
>> > R-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>>
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread brodie gaslam via R-devel
> On Sunday, May 23, 2021, 10:45:22 AM EDT, Adrian Dușa  
> wrote:
>
> On Sun, May 23, 2021 at 4:33 PM brodie gaslam via R-devel 
>  wrote:
> > I should add, I don't know that you can rely on this
> > particular encoding of R's NA.  If I were trying to restore
> > an NA from some external format, I would just generate an
> > R NA via e.g NA_real_ in the R session I'm restoring the
> > external data into, and not try to hand assemble one.
>
> Thanks for your answer, Brodie, especially on Sunday (much appreciated).

Maybe I shouldn't answer on Sunday given I've said several wrong things...

> The aim is not to reconstruct an NA, but to "tag" an NA (and yes, I was
> referring to an NA_real_ of course), as seen in action here:
> https://github.com/tidyverse/haven/blob/master/src/tagged_na.c
>
> That code:
> - preserves the first part 0x7ff0
> - preserves the last part 1954
> - adds one additional byte to store (tag) a character provided in the SEXP 
> vector
>
> That is precisely my understanding, that doubles starting with 0x7ff are
> all NaNs. My question was related to the additional part 1954 from the
> low bits: why does it need 32 bits?

It probably doesn't need 32 bits.  The code is trying to set all 64 bits.
It seems natural to do the high 32 bit, and then the low.  But I'm not R
Core so don't listen to me too closely.

> The binary value of 1954 is 0100010, which is represented by 11 bits
> occupying at most 2 bytes... So why does it need 4 bytes?
>
> Re. the possible overflow, I am not sure: 0x7ff0 is the decimal 32752,
> or the binary 111.

You are right, I had a moment and wrongly counted hex digits as bytes
instead of half-bytes.

> That is just about enough to fit in the available 16 bits (actually 15
> to leave one for the sign bit), so I don't really understand why it
> would. And in > any case, the union definition uses an unsigned short
> which (if my understanding is correct) should certainly not overflow:
>
> typedef union
> {
> double value;
> unsigned short word[4];
> } ieee_double;
>
> What is gained with this proposal: 16 additional bits to do something
> with. For the moment, only 16 are available (from the lower part of the
> high 32 bits). If the value 1954 would be checked as a short instead of
> an int, the other 16 bits would become available. And those bits could
> be extremely valuable to tag multi-byte characters, for instance, but
> also higher numbers than 32767.

Note that the stability of the payload portion of NaNs is questionable:

https://developer.r-project.org/Blog/public/2020/11/02/will-r-work-on-apple-silicon/#nanan-payload-propagation

Also, if I understand correctly, you would be asking R core to formalize
the internal representation of the R NA, which I don't think is public?
So that you can use those internal bits for your own purposes with a
guarantee that R will not disturb them?  Obviously only they can answer
that.

Apologies for confusing the issue.

B,

PS: the other obviously wrong thing I said was the NA was 0x7ff0  &
1954 when it is really 0x7ff0    & 1954 when.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread brodie gaslam via R-devel
I should add, I don't know that you can rely on this
particular encoding of R's NA.  If I were trying to restore
an NA from some external format, I would just generate an
R NA via e.g NA_real_ in the R session I'm restoring the 
external data into, and not try to hand assemble one.

Best,

B.


On Sunday, May 23, 2021, 9:23:54 AM EDT, brodie gaslam via R-devel 
 wrote: 





This is because the NA in question is NA_real_, which
is encoded in double precision IEEE-754, which uses
64 bits.  The "1954" is just part of the NA.  The NA
must also conform to the NaN encoding for double precision
numbers, which requires that the "beginning" portion of
the number be "0x7ff0" (well, I think it should be "0x7ff8"
but that's a different story), as you can see here:

    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;

Both those components are part of the same double precision
value.  They are just accessed this way to make it easy to
set the high bits (63-32) and the low bits (31-0).

So NA is not just 1954, its 0x7ff0  & 1954 (note I'm
mixing hex and decimals here).

In IEEE 754 double precision encoding numbers that start
with 0x7ff are all NaNs.  The rest of the number except for
the first bit which designates "quiet" vs "signaling" NaNs can
be anything.  R has taken advantage of that to designate the
R NA by setting the lower bits to be 1954.

Note I'm being pretty loose about endianess, etc. here, but
hopefully this conveys the problem.

In terms of your proposal, I'm not entirely sure what you gain.
You're still attempting to generate a 64 bit representation
in the end.  If all you need is to encode the fact that there
was an NA, and restore it later as a 64 bit NA, then you can do
whatever you want so long as the end result conforms to the
expected encoding.

In terms of using 'short' here (which again, I don't see the
need for as you're using it to generate the final 64 bit encoding),
I see two possible problems.  You're adding the dependency that
short will be 16 bits.  We already have the (implicit) assumption
in R that double is 64 bits, and explicit that int is 32 bits.
But I think you'd be going a bit on a limb assuming that short
is 16 bits (not sure).  More important, if short is indeed 16 bits,
I think in:

    x.word[hw] = 0x7ff0;

You overflow short.

Best,

B.



On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa  
wrote: 





Dear R devs,

I am probably missing something obvious, but still trying to understand why
the 1954 from the definition of an NA has to fill 32 bits when it normally
doesn't need more than 16.

Wouldn't the code below achieve exactly the same thing?

typedef union
{
    double value;
    unsigned short word[4];
} ieee_double;


#ifdef WORDS_BIGENDIAN
static CONST int hw = 0;
static CONST int lw = 3;
#else  /* !WORDS_BIGENDIAN */
static CONST int hw = 3;
static CONST int lw = 0;
#endif /* WORDS_BIGENDIAN */


static double R_ValueOfNA(void)
{
    volatile ieee_double x;
    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;
    return x.value;
}

This question has to do with the tagged NA values from package haven, on
which I want to improve. Every available bit counts, especially if
multi-byte characters are going to be involved.

Best wishes,
-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] 1954 from NA

2021-05-23 Thread brodie gaslam via R-devel
This is because the NA in question is NA_real_, which
is encoded in double precision IEEE-754, which uses
64 bits.  The "1954" is just part of the NA.  The NA
must also conform to the NaN encoding for double precision
numbers, which requires that the "beginning" portion of
the number be "0x7ff0" (well, I think it should be "0x7ff8"
but that's a different story), as you can see here:

    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;

Both those components are part of the same double precision
value.  They are just accessed this way to make it easy to
set the high bits (63-32) and the low bits (31-0).

So NA is not just 1954, its 0x7ff0  & 1954 (note I'm
mixing hex and decimals here).

In IEEE 754 double precision encoding numbers that start
with 0x7ff are all NaNs.  The rest of the number except for
the first bit which designates "quiet" vs "signaling" NaNs can
be anything.  R has taken advantage of that to designate the
R NA by setting the lower bits to be 1954.

Note I'm being pretty loose about endianess, etc. here, but
hopefully this conveys the problem.

In terms of your proposal, I'm not entirely sure what you gain.
You're still attempting to generate a 64 bit representation
in the end.  If all you need is to encode the fact that there
was an NA, and restore it later as a 64 bit NA, then you can do
whatever you want so long as the end result conforms to the
expected encoding.

In terms of using 'short' here (which again, I don't see the
need for as you're using it to generate the final 64 bit encoding),
I see two possible problems.  You're adding the dependency that
short will be 16 bits.  We already have the (implicit) assumption
in R that double is 64 bits, and explicit that int is 32 bits.
But I think you'd be going a bit on a limb assuming that short
is 16 bits (not sure).  More important, if short is indeed 16 bits,
I think in:

    x.word[hw] = 0x7ff0;

You overflow short.

Best,

B.



On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Dușa  
wrote: 





Dear R devs,

I am probably missing something obvious, but still trying to understand why
the 1954 from the definition of an NA has to fill 32 bits when it normally
doesn't need more than 16.

Wouldn't the code below achieve exactly the same thing?

typedef union
{
    double value;
    unsigned short word[4];
} ieee_double;


#ifdef WORDS_BIGENDIAN
static CONST int hw = 0;
static CONST int lw = 3;
#else  /* !WORDS_BIGENDIAN */
static CONST int hw = 3;
static CONST int lw = 0;
#endif /* WORDS_BIGENDIAN */


static double R_ValueOfNA(void)
{
    volatile ieee_double x;
    x.word[hw] = 0x7ff0;
    x.word[lw] = 1954;
    return x.value;
}

This question has to do with the tagged NA values from package haven, on
which I want to improve. Every available bit counts, especially if
multi-byte characters are going to be involved.

Best wishes,
-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Difference in NA behavior in R-devel running under valgrind

2021-04-29 Thread brodie gaslam via R-devel
Forgot to mention, my builds were not instrumented for valgrind, and also:

vagrant@vagrant:/vagrant/trunk$ ./bin/R --version
R Under development (unstable) (2021-04-27 r80232) -- "Unsuffered Consequences"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)


 On Thursday, April 29, 2021, 8:40:21 PM EDT, brodie gaslam via R-devel 
 wrote: 


> On Thursday, April 29, 2021, 6:35:16 PM EDT, Winston Chang 
>  wrote:
> Just to be clear, the RD binary that Jon used was NOT compiled with
> Valgrind level 2 instrumentation. In his example, however, he did run it
> with valgrind, as in:
>
> # RD -d valgrind --quiet -e "sum(c(1, NA))"
> ...
> > sum(c(1, NA))
> [1] NaN
>
> `RD` in that Docker image is a standard build of R-devel. The Docker
> image does include a build of R-devel with valgrind level 2
> instrumentation, called `RDvalgrind`. It exhibits the same NaN behavior
> when run with `-d valgrind`, but when run without `-d valgrind` it
> returns NA.
>
> # RDvalgrind -d valgrind --quiet -e "sum(c(1, NA))"
> > sum(c(1, NA))
> [1] NaN
>
> # RDvalgrind --quiet -e "sum(c(1, NA))"
> > sum(c(1, NA))
> [1] NA
>
> In short `RDvalgrind` behaves the same as `RD`. However, adding `-d
> valgrind` to either one causes it to return NaN.
>
> -Winston

Thanks for the clarification.

Using the same revision in the e-mail I get the reported issue under -O0,
but not under -O2.  It looks like r-debug builds with -O0, and presumably
the release debian version is -O2 which is usually the default?  I'm
certainly no expert in this area but it seems like it could be an
optimization level rather than a release vs devel issue?

This is what I see:

gcc version 10.1.0 (Ubuntu 10.1.0-2ubuntu1~18.04)

    CC="gcc-10"
    CFLAGS="-g -O0 -Wall -pedantic -Wextra -std=gnu99"

    vagrant@vagrant:/vagrant/trunk$ ./bin/R -d valgrind --vanilla --quiet -e 
"sum(c(1,NA))"
    ==9020== Memcheck, a memory error detector
    ==9020== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==9020== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==9020== Command: /vagrant/trunk/bin/exec/R --vanilla --quiet -e 
sum(c(1,NA))
    ==9020==
    > sum(c(1,NA))
    [1] NaN

And then:

    CC="gcc-10"
    CFLAGS="-g -O2 -Wall -pedantic -Wextra -std=gnu99"

    vagrant@vagrant:/vagrant/trunk$ ./bin/R -d valgrind --vanilla --quiet -e 
"sum(c(1,NA))"
    ==32751== Memcheck, a memory error detector
    ==32751== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==32751== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==32751== Command: /vagrant/trunk/bin/exec/R --vanilla --quiet -e 
sum(c(1,NA))
    ==32751==
    > sum(c(1,NA))
    [1] NA

Sadly I did not think to run the -O0 version without valgrind.


Best,

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Difference in NA behavior in R-devel running under valgrind

2021-04-29 Thread brodie gaslam via R-devel
 > On Thursday, April 29, 2021, 6:35:16 PM EDT, Winston Chang 
 >  wrote:
> Just to be clear, the RD binary that Jon used was NOT compiled with
> Valgrind level 2 instrumentation. In his example, however, he did run it
> with valgrind, as in:
>
> # RD -d valgrind --quiet -e "sum(c(1, NA))"
> ...
> > sum(c(1, NA))
> [1] NaN
>
> `RD` in that Docker image is a standard build of R-devel. The Docker
> image does include a build of R-devel with valgrind level 2
> instrumentation, called `RDvalgrind`. It exhibits the same NaN behavior
> when run with `-d valgrind`, but when run without `-d valgrind` it
> returns NA.
>
> # RDvalgrind -d valgrind --quiet -e "sum(c(1, NA))"
> > sum(c(1, NA))
> [1] NaN
>
> # RDvalgrind --quiet -e "sum(c(1, NA))"
> > sum(c(1, NA))
> [1] NA
>
> In short `RDvalgrind` behaves the same as `RD`. However, adding `-d
> valgrind` to either one causes it to return NaN.
>
> -Winston

Thanks for the clarification.

Using the same revision in the e-mail I get the reported issue under -O0,
but not under -O2.  It looks like r-debug builds with -O0, and presumably
the release debian version is -O2 which is usually the default?  I'm
certainly no expert in this area but it seems like it could be an
optimization level rather than a release vs devel issue?

This is what I see:

gcc version 10.1.0 (Ubuntu 10.1.0-2ubuntu1~18.04)

    CC="gcc-10"
    CFLAGS="-g -O0 -Wall -pedantic -Wextra -std=gnu99"

    vagrant@vagrant:/vagrant/trunk$ ./bin/R -d valgrind --vanilla --quiet -e 
"sum(c(1,NA))"
    ==9020== Memcheck, a memory error detector
    ==9020== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==9020== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==9020== Command: /vagrant/trunk/bin/exec/R --vanilla --quiet -e 
sum(c(1,NA))
    ==9020==
    > sum(c(1,NA))
    [1] NaN

And then:

    CC="gcc-10"
    CFLAGS="-g -O2 -Wall -pedantic -Wextra -std=gnu99"

    vagrant@vagrant:/vagrant/trunk$ ./bin/R -d valgrind --vanilla --quiet -e 
"sum(c(1,NA))"
    ==32751== Memcheck, a memory error detector
    ==32751== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==32751== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
    ==32751== Command: /vagrant/trunk/bin/exec/R --vanilla --quiet -e 
sum(c(1,NA))
    ==32751==
    > sum(c(1,NA))
    [1] NA

Sadly I did not think to run the -O0 version without valgrind.

Best,

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Difference in NA behavior in R-devel running under valgrind

2021-04-29 Thread brodie gaslam via R-devel
NA propagation is complicated.  I don't know whether what
you observe could be explained by the difference between
a valgrind instrumented vs. not version of R (I gather the
release version you used is not instrumented / possibly
compiled differently too from the github issue?).

Hopefully someone more knowledgeable would chime in, but
I did want to share Tomas's post that touches on the
complexity of NA_real_ propagation:

https://developer.r-project.org/Blog/public/2020/11/02/will-r-work-on-apple-silicon/#nanan-payload-propagation

Best,

B.






On Thursday, April 29, 2021, 6:04:38 PM EDT, Jonathan Keane  
wrote: 





Hello,

I'm debugging some valgrind issues, and noticed some odd behavior with
NA an R-devel under valgrind.

Using Winston Chang's r-debug image (and some of this reproductions form [1]):

r-devel (2021-04-27 r80232) without Valgrind returns NA:
# RD --quiet -e "sum(c(1, NA))"
> sum(c(1, NA))
[1] NA


r-devel with `-d valgrind` returns NaN:
# RD -d valgrind --quiet -e "sum(c(1, NA))"
==8901== Memcheck, a memory error detector
==8901== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==8901== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==8901== Command: /usr/local/RD/lib/R/bin/exec/R --quiet -e sum(c(1,~+~NA))
==8901==
> sum(c(1, NA))
[1] NaN


And finally release R (with valgrind) returns NA just like r-devel
while not under Valgrind:
# R -d valgrind --quiet -e "sum(c(1, NA))"
==8983== Memcheck, a memory error detector
==8983== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==8983== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==8983== Command: /usr/lib/R/bin/exec/R --quiet -e sum(c(1,~+~NA))
==8983==
> sum(c(1, NA))
[1] NA

Thanks

[1] - https://github.com/wch/r-debug/issues/18

-Jon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.list fails on functions with S3 classes

2021-04-28 Thread brodie gaslam via R-devel


> On Wednesday, April 28, 2021, 5:16:20 PM EDT, Gabriel Becker 
>  wrote:
>
> Hi Antoine,
>
> I would say this is the correct behavior. S3 dispatch is solely (so far as
> I know?) concerned with the "actual classes" on the object. This is because
> S3 classes act as labels that inform dispatch what, and in what order,
> methods should be applied. You took the function class (ie label) off of
> your object, which means that in the S3 sense, that object is no longer a
> function and dispatching to function methods for it would be incorrect.
> This is independent of whether the object is still callable "as a function".
>
> The analogous case for non-closures to what you are describing would be for
> S3 to check mode(x) after striking out with class(x) to find relevant
> methods. I don't think that would be appropriate.

I would think of the general case to be to check `class(unclass(x))` on
strike-out.  This would then include things such as "matrix", etc.
Dispatching on the implicit class as fallback seems like a natural thing
to do in a language that dispatches on implicit class when there is none.
After all, once you've struck out of your explicit classes, you have
none left!

This does happen naturally in some places (e.g. interacting with a
data.frame as a list), and is quite delightful (usually).  I won't get
into an argument of what the documentation states or whether any changes
should be made, but to me that dispatch doesn't end with the implicit
class seems feels like a logical wrinkle.  Yes, I can twist my brain to
see how it can be made to make sense, but I don't like it.

A fun past conversation on this very topic:

https://stat.ethz.ch/pipermail/r-devel/2019-March/077457.html

Best,

B.

> Also, as an aside, if you want your class to override methods that exist
> for function you would want to set the class to c("foo", "function"), not
> c("function", "foo"), as you had it in your example.
>
> Best,
> ~G
>
> On Wed, Apr 28, 2021 at 1:45 PM Antoine Fabri 
> wrote:
>
>> Dear R devel,
>>
>> as.list() can be used on functions, but not if they have a S3 class that
>> doesn't include "function".
>>
>> See below :
>>
>> ```r
>> add1 <- function(x) x+1
>>
>> as.list(add1)
>> #> $x
>> #>
>> #>
>> #> [[2]]
>> #> x + 1
>>
>> class(add1) <- c("function", "foo")
>>
>> as.list(add1)
>> #> $x
>> #>
>> #>
>> #> [[2]]
>> #> x + 1
>>
>> class(add1) <- "foo"
>>
>> as.list(add1)
>> #> Error in as.vector(x, "list"): cannot coerce type 'closure' to vector of
>> type 'list'
>>
>> as.list.function(add1)
>> #> $x
>> #>
>> #>
>> #> [[2]]
>> #> x + 1
>> ```
>>
>> In failing case the argument is dispatched to as.list.default instead of
>> as.list.function.
>>
>> (1) Shouldn't it be dispatched to as.list.function ?
>>
>> (2) Shouldn't all generics when applied on an object of type closure fall
>> back to the `fun.function` method  before falling back to the `fun.default`
>> method ?
>>
>> Best regards,
>>
>> Antoine

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R crashes when using huge data sets with character string variables

2020-12-12 Thread brodie gaslam via R-devel
 > On Saturday, December 12, 2020, 6:33:33 PM EST, Ben Bolker 
 >  wrote:
>
>  On Windows you can use memory.limit.
>
> https://stackoverflow.com/questions/12582793/limiting-memory-usage-in-r-under-linux
>
>    Not sure how much that helps.
>
>On 12/12/20 6:19 PM, Arne Henningsen wrote:
>> When working with a huge data set with character string variables, I
>> experienced that various commands let R crash. When I run R in a
>> Linux/bash console, R terminates with the message "Killed". When I use
>> RStudio, I get the message "R Session Aborted. R encountered a fatal
>> error. The session was terminated. Start New Session". If an object in
>> the R workspace needs too much memory, I would expect that R would not
>> crash but issue an error message "Error: cannot allocate vector of
>> size ...".  A minimal reproducible example (at least on my computer)
>> is:
>>
>> nObs <- 1e9
>>
>> date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,
>> 1, 12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
>>
>> Is this a bug or a feature of R?

On OS X I see:

    > nObs <- 1e9
    >  date <- paste( round( runif( nObs, 1981, 2015 ) ), round( runif( nObs,1, 
12 ) ), round( runif( nObs, 1, 31 ) ), sep = "-" )
    Error: vector memory exhausted (limit reached?)
    > sessionInfo()
    R version 4.0.3 (2020-10-10)
    Platform: x86_64-apple-darwin17.0 (64-bit)
    Running under: macOS Catalina 10.15.7

Which is what I would expect.  I don't doubt the error you've seen, just
providing a data point for whoever ends up looking into this further.

Best,

Brodie.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sys.call() 's srcref doesn't match the language

2020-09-02 Thread brodie gaslam via R-devel
>On Wednesday, September 2, 2020, 6:19:20 PM EDT, Lionel Henry 
> wrote:
>
>Hello,
>
>The source references are useful for debugging tools because they
>allow linking to call sites in the source files.
>
>I agree the output can be confusing. Perhaps this could be fixed by
>tweaking the print method for calls. If the deparsed call doesn't
>match the srcref, both could be displayed along with file:line:column.
>
>```
>#> f()
>#> 
>#> 1 + f()
>```
>
>Best,
>Lionel

Why print the mismatched srcref at all?  I find that confusing.
Just omit the srcref from display. Debugging tools can
still retrieve it and use the information, presumably.

Best,

Brodie.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] eval and Calling Frames

2020-05-31 Thread brodie gaslam via R-devel
 I ran into an interesting issue with `evalq` (and also
`eval(quote(...))`):

 f <- function() {
   list(
 sys.parent(1),
 evalq(sys.parent(1)),
 evalq((function() sys.parent(2))()),  # add an anon fun layer
 evalq((function() sys.parent(1))())
   )
 }
 res <- f()
 str(res)
 ## List of 4
 ##  $ : int 0 # sys.parent(1)
 ##  $ : int 2 # evalq(sys.parent(1))
 ##  $ : int 0 # evalq((function() sys.parent(2))())
 ##  $ : int 1 # evalq((function() sys.parent(1))())

In order of least to most surprising:

1. `sys.parent(1)` and `evalq(sys.parent(1))` are not the same
2. `evalq(sys.parent(1))` and `evalq((function() sys.parent(2))())`
   are not the same
3. `evalq((function() sys.parent(1))())` returns a lower frame number
   than `evalq(sys.parent(1))`

The root cause of this is that the `evalq` **closure** sets a context,
but then the C-level `do_eval` it invokes sets another one[1] with the
new `evalq` context as the calling frame (`cptr->sysparent`)[2].  This
then interacts with how `sys.parent` resolves parents when a target
frame appears more than once in the context stack.  `sys.parent`
returns the oldest context that matches[3], and in this case `f`'s
frame appears twice because `evalq` adds it via `do_eval`.

One option is to change what `sysparent` of the `evalq` `envir`.
For example, if we set it to be the same as it would be for commands
outside the `evalq` we get:

 str(res)
 ## List of 4
 ##  $ : int 0 # sys.parent(1)
 ##  $ : int 0 # evalq(sys.parent(1))
 ##  $ : int 0 # evalq((function() sys.parent(2))())
 ##  $ : int 1 # evalq((function() sys.parent(1))())

There is precedent for doing this in S3 generics and their methods
where method `sysparent` is set to be that of the generic.  Now
`evalq` no longer interferes with the resolution of calling frames.
It seems reasonable to set evaluation environments without affecting
what the calling frame is. Indeed that happens when we do something like
`environment(fun) <- blah` as the calling frame is unaffected when `fun` is
invoked.

I attach a patch that implements this change.  The patch is a
hack-job intended solely for illustrative purposes, though it does
pass `make check-all` on a current version of r-devel.  I also ran the
`rlang` tests as those probably push the envelope in this area.  There
only one failed with 2,613 passing.  The failed one is for a
deprecated function that was specifically checking for the repeated
`evalq` contexts[7].

I also attach a document with additional examples and commentary for
those interested.

Best,

Brodie.

PS: for a loosely related issue see #15531[8].

[1]: https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/eval.c#L3329
[2]: https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/context.c#L260
[3]: https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/context.c#L433
[4]: https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/eval.c#L1815
[5]: https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Contexts
[6]: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15531
[7]: 
https://github.com/r-lib/rlang/blob/v0.4.6/tests/testthat/test-retired.R#L437
[8]: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15531


Index: src/library/base/R/eval.R
===
--- src/library/base/R/eval.R   (revision 78619)
+++ src/library/base/R/eval.R   (working copy)
@@ -23,7 +23,7 @@
 function(expr, envir = parent.frame(),
 enclos = if(is.list(envir) || is.pairlist(envir))
parent.frame() else baseenv())
-.Internal(eval(expr, envir, enclos))
+.Internal(eval(expr, envir, enclos, parent.frame(2L)))
 
 eval.parent <- function(expr, n = 1) {
 p <- parent.frame(n + 1)
@@ -33,7 +33,7 @@
 evalq <-
 function (expr, envir = parent.frame(), enclos = if (is.list(envir) ||
 is.pairlist(envir)) parent.frame() else baseenv())
-  .Internal(eval(substitute(expr), envir, enclos))
+ .Internal(eval(substitute(expr), envir, enclos, parent.frame(2L)))
 
 new.env <- function (hash = TRUE, parent = parent.frame(), size = 29L)
 .Internal(new.env(hash, parent, size))
Index: src/library/base/baseloader.R
===
--- src/library/base/baseloader.R   (revision 78619)
+++ src/library/base/baseloader.R   (working copy)
@@ -97,7 +97,7 @@
 
 ..lazyLoad(basedb, baseenv())
 
-}), .Internal(new.env(FALSE, baseenv(), 29L)), baseenv()))
+}), .Internal(new.env(FALSE, baseenv(), 29L)), baseenv(), baseenv()))
 
 ## keep in sync with R/zzz.R
 as.numeric <- as.double
Index: src/main/eval.c
===
--- src/main/eval.c (revision 78619)
+++ src/main/eval.c (working copy)
@@ -3267,7 +3267,7 @@
 
 SEXP attribute_hidden do_eva

[Rd] R-ints context documentation

2020-05-26 Thread brodie gaslam via R-devel
In 1.4 Contexts[1], should the following:

> Note that whilst calls to closures and builtins set a context, 
> those to special internal functions never do.

Be something like:

> Note that whilst calls to closures always set a context,
> those to builtins only set a context under profiling
> or if they are of the foreign variety (e.g `.C` and similar),
> and those to special internal functions never do.

Based on the 'eval.c' source[2].

Best,

Brodie

[1]: https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Contexts
[2]: https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/eval.c#L821

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] paste(character(0), collapse="", recycle0=FALSE) should be ""

2020-05-22 Thread brodie gaslam via R-devel
> On Friday, May 22, 2020, 6:16:45 PM EDT, Hervé Pagès  
> wrote:
>
> Gabe,
>
> It's the current behavior of paste() that is a major source of bugs:
>
>   ## Add "rs" prefix to SNP ids and collapse them in a
>   ## comma-separated string.
>   collapse_snp_ids <- function(snp_ids)
>   paste("rs", snp_ids, sep="", collapse=",")
>
>   snp_groups <- list(
> group1=c(55, 22, 200),
> group2=integer(0),
> group3=c(99, 550)
>   )
>
>   vapply(snp_groups, collapse_snp_ids, character(1))
>   #    group1    group2    group3
>   # "rs55,rs22,rs200"  "rs"  "rs99,rs550"
>
> This has hit me so many times!
>
> Now with 'collapse0=TRUE', we finally have the opportunity to make it do
> the right thing. Let's not miss that opportunity.
>
> Cheers,
> H.

FWIW what convinces me is consistency with other aggregating functions applied
to zero length inputs:

sum(numeric(0))
## [1] 0

>
>
> On 5/22/20 11:26, Gabriel Becker wrote:
> > I understand that this is consistent but it also strikes me as an
> > enormous 'gotcha' of a magnitude that 'we' are trying to avoid/smooth
> > over at this point in user-facing R space.
> >
> > For the record I'm not suggesting it should return something other than
> > "", and in particular I'm not arguing that any call to paste /that does
> > not return an error/ with non-NULL collapse should return a character
> > vector of length one.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Precision of function mean,bug?

2020-05-20 Thread brodie gaslam via R-devel
 > On Wednesday, May 20, 2020, 7:00:09 AM EDT, peter dalgaard 
 >  wrote:
>
> Expected, see FAQ 7.31.
>
> You just can't trust == on FP operations. Notice also

Additionally, since you're implementing a "mean" function you are testing
against R's mean, you might want to consider that R uses a two-pass
calculation[1] to reduce floating point precision error.

Best,

Brodie.

[1] https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/summary.c#L482

> > a2=(z[idx]+x[idx]+y[idx])/3
> > a2==a
> [1] FALSE
> > a2==b
> [1] TRUE
>
> -pd
>
> > On 20 May 2020, at 12:40 , Morgan Morgan  wrote:
> >
> > Hello R-dev,
> >
> > Yesterday, while I was testing the newly implemented function pmean in
> > package kit, I noticed a mismatch in the output of the below R expressions.
> >
> > set.seed(123)
> > n=1e3L
> > idx=5
> > x=rnorm(n)
> > y=rnorm(n)
> > z=rnorm(n)
> > a=(x[idx]+y[idx]+z[idx])/3
> > b=mean(c(x[idx],y[idx],z[idx]))
> > a==b
> > # [1] FALSE
> >
> > For idx= 1, 2, 3, 4 the last line is equal to TRUE. For 5, 6 and many
> > others the difference is small but still.
> > Is that expected or is it a bug?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] edit() doubles backslashes when keep.source=TRUE

2020-05-15 Thread brodie gaslam via R-devel


> On Friday, May 15, 2020, 12:13:04 PM EDT, Dirk Eddelbuettel  
> wrote:
> On 15 May 2020 at 15:41, Martin Maechler wrote:
> | 
> |
> |    Why does nobody anymore  help R development by working with
> |    "R-devel", or at least then the alpha, beta and the "RC"
> |    (Release Candidate) versions that we release daily for about one
> |    month before the final release?
> |
> |    Notably a highly staffed enterprise such as Rstudio (viz the bug
> |    report 17800 above), but also others could really help by
> |    starting to use the "next version" of R on a routine basis ...
> |
> | 
>
> Seconded. Without testing we can never know. R Core does their part.
>
> I provided weekly Debian binaries. One each for the two alphas releases, for
> the beta release, for the release candidate.  It is easy to use these, for
> example in a Docker container.
>
> It is also easy to use this on a normal machine as they are standard (Debian)
> packages: install, try some tests, uninstall, revert to previous version by
> installing that.
>
> Dirk

This is a very reasonably request, and all useRs who benefit from the
tireless work of R-core should consider doing it.  I have considered
it, but compiling R from sources on OS X has been my stumbling block.
At least last time I tried I got stuck at the  Fortran step. It doesn't
 help I have very limited experience compiling  software of the complexity 
of R.  Really, I've only done it within the warm welcoming confines of the
 vagrant image Tomas Kalibera set up for `rchk`.

I also use r-devel on docker, but that isn't very practical for
day-to-day usage, which is what I think we need.

What would it take to generate pre-release binaries for OS X (and Windows)?  I
imagine if such were available the volume of testers would increase
dramatically (at least, I haven't seen them if they exist).  
Maybe something the R Consortium would consider funding?

Best,

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Minor Infelicity in Printing of Objects Nested in Lists

2020-05-10 Thread brodie gaslam via R-devel
 > On Sunday, May 10, 2020, 12:24:17 PM EDT, Lionel Henry  
 > wrote:
>
> The main reason for resetting the tagbuf in `print.default()` and
> other entry points to the print routine is that it is currently not
> reset on exit. Creating a context to reset it on exit to its last
> value might work. This should be done in the entry points rather than
> in print-value-rec though, since callers of the latter might write to
> the tagbuf.
>
> Another solution to this problem is proposed with the first patch in
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17610. Maybe I
> should extract this patch into its own entry so the discussion can be
> separate from the other patches.

Ah, looks like the exact same issue, at least the first patch.  I'm not
 wed to any particular solution, but I am happy to help test/work on this
if there is interest in addressing this.

Best,

B.

>On 5/10/20, brodie gaslam via R-devel  wrote:
>> Currently S3 objects nested in generic vectors cause the tag buffer to be
>> reset.  This feels sub-optimal for those objects that don't have a print
>> method:
>>
>>> list(a=list(b='hello'))
>> $a
>> $a$b ### <<<< notice "$a$b"
>> [1] "hello"
>>
>>
>>> list(a=structure(list(b='hello'), class='world'))
>> $a
>> $b   ### <<<< notice "$b", not "$a$b"
>> [1] "hello"
>>
>> attr(,"class")
>> [1] "world"

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Minor Infelicity in Printing of Objects Nested in Lists

2020-05-10 Thread brodie gaslam via R-devel
Currently S3 objects nested in generic vectors cause the tag buffer to be 
reset.  This feels sub-optimal for those objects that don't have a print method:

> list(a=list(b='hello'))
$a
$a$b ###  notice "$a$b"
[1] "hello"


> list(a=structure(list(b='hello'), class='world'))
$a
$b   ###  notice "$b", not "$a$b"
[1] "hello"

attr(,"class")
[1] "world"

This happens because the default print method resets the tag buffer anytime 
it's called[1], whether by a custom print method, or by internal C code as part 
of the recursion into objects in `printValueRec`[2].

One possible way to "fix" this is to make it the responsibility of 
`printValueRec` to reset the tag buffer on exit from the top level, but that 
would mean always having an active context to catch errors instead of just for 
win32 as is the case now.  Additionally, print method authors may themselves 
intentionally rely on `print.default` for some parts their output, in which 
case it may be desirable to reset the tag buffer.

Obviously not a big deal.  In my use case it can make it difficult to identify 
what part of a complex object caused a difference in a diff with a few 
neighboring context lines.  Just mentioning it in case there is interest in 
doing something about it.  If so I'll be happy to create a bugzilla ticket and 
assist with patches/testing.

Best,

B.

[1]: https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/print.c#L309
[2]: https://github.com/wch/r-source/blob/tags/R-4-0-0/src/main/print.c#L570

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Character Column-width Calculations

2020-05-03 Thread brodie gaslam via R-devel
I just submitted a patch on bugzilla[1] to update the internal column-width 
tables to a more recent version of Unicode.  The most obvious way this shows up 
is with emoji now having a computed width of 2 columns instead of 1 once the 
patch is applied:

> nchar('\U0001F600', type='width'). # grinning face emoji
[1] 2 

This wasn't really an issue for a long time as many terminals didn't even do 
this properly either, but over the last couple of years it seems as if 
terminals have started to update their width tables.

I don't personally care much for emoji, but do like my wrapped-text columns to 
be nice and even, which is why I'm proposing the patch.  Posting here in case 
others are interested in the issue.

This doesn't do any grapheme computations so it won't work with combining 
emoji, but should still be a step in the right direction.

Best,

Brodie

[1]: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17781

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible Bug In Validation of UTF-8 Sequences

2020-04-04 Thread brodie gaslam via R-devel
As per `?intToUtf8`, and in the comments to `valid_utf8`[1], R 
intends to prevent illegal UTF-8 such as UTF-8 encoded
UTF-16 surrogate pairs.  `R_nchar`, invoked via `base::nchar`,
explicitly validates UTF-8 strings[2], but allows the surrogate:

    > Encoding('\ud800')
    [1] "UTF-8"
    > nchar('\ud800')  // should be an error
    [1] 1

The problem manifests on systems where `char` is signed.  The logic
used to test for the forbidden sequences implicitly assumes that
`char` is unsigned[3]:

    if (c == 0xe0 && (d & 0x20) == 0) return 1;
    if (c == 0xed && d >= 0xa0) return 1;

Notice the `d >= 0xa0`.  On a system where `char` is signed this can
only ever be true if a byte has more than 8 bits, as otherwise the
maximum value of `d` is 0x7f because `d` is retrieved from a plain
`char` pointer[4] (d is `int`):

    if (((d = *(++p)) & 0xc0) != 0x80) return 1;

Where `p` is defined as[5]:

 const char *p;

In contrast `c` above is correctly cast to `unsigned char` prior to
use[8]:

 c = (unsigned char)*p;

I attach a simple patch to address this.

I also include a patch to remove the handling of surrogates from
`R_nchar` as that should not longer be necessary, and additionally the
current handling appears incorrect.  AFAICT, the current handling
attempts to decode a surrogate pair by combining the high surrogate
with the same high surrogate, instead of the high surrogate with the
subsequent character that hopefully is the low surrogate[7].

Here is some code that could be added to regression tests:

    surr_good <- '\ud840\udc00'    # auto-converts to normal
    surr_bad <- paste0('\ud840', '\udc00') # surrogates remain
    good <- c('hello', 'world', surr_good, '\ud7ff', '\ue000', '\U0010')
    bad <- c(surr_bad, '\ud800', '\udfff', '\U0011')

On R3.6.3:

    nchar(good, allowNA=TRUE)
    [1] 5 5 1 1 1 1
    nchar(bad, allowNA=TRUE)
    [1] 2 1 1 1

On R-devel (2020-03-31 r78116) w/ patch:

    nchar(good, allowNA=TRUE)
    [1] 5 5 1 1 1 1
    nchar(bad, allowNA=TRUE)
    [1] NA NA NA NA

I ran `make check-devel` successfully, although I did have to suppress
one PCRE test[9] that segfaults on my particular set-up, though that
segfaulted prior to my patch as well.

The patch does not prevent the creation of illegal UTF-8 strings,
although I imagine it would be straightforward to add checks to the
entry points into CHARSXPs if that were desired.

Finally, this signed char business hints at a potential broader issue.
If I understand correctly importing byte sequences with values greater
than 0x7f overflows the `char` buffers on systems with signed chars
and octet (or lesser) bytes, e.g. as in `do_readLines`[6] where an
integer procured via `fgetc` is subsequently stored in a `char`
buffer.  Obviously this hasn't mattered much to date, presumably
because the implementations R runs on define the `unsigned char` to
`signed char` conversion in such a way that the `signed char` to
`unsigned char` conversion restores the original value.  I don't know
if this is something explicitly checked for like the `int` == 32bits
assumption.

[1]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L61
[2]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L148
[3]: 
https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L106
[4]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L84
[5]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L69
[6]: 
https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/connections.c#L3935
[7]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/character.c#L184
[8]: https://github.com/wch/r-source/blob/tags/R-3-6-3/src/main/valid_utf8.h#L73
[9]: https://github.com/wch/r-source/blob/tags/R-3-6-3/tests/PCRE.R#L16

Index: src/main/valid_utf8.h
===
--- src/main/valid_utf8.h   (revision 78116)
+++ src/main/valid_utf8.h   (working copy)
@@ -66,11 +66,11 @@
 static int
 valid_utf8(const char *string, size_t length) // R change int->size_t
 {
-const char *p;
+const unsigned char *p;
 
 for (p = string; length-- > 0; p++) {
-   int ab, c, d;
-   c = (unsigned char)*p;
+   unsigned int ab, c, d;
+   c = *p;
if (c < 128) continue;/* ASCII character */
if (c < 0xc0) return 1;   /* Isolated 10xx  byte */
if (c >= 0xfe) return 1; /* Invalid 0xfe or 0xff bytes */
Index: src/main/character.c
===
--- src/main/character.c(revision 78116)
+++ src/main/character.c(working copy)
@@ -180,10 +180,7 @@
int nc = 0;
for( ; *p; p += utf8clen(*p)) {
utf8toucs(&wc1, p);
-   if (IS_HIGH_SURROGATE(wc1))
-   ucs = utf8toucs32(wc1, p);
-   

Re: [Rd] findInterval Documentation Suggestion

2020-03-06 Thread brodie gaslam via R-devel
 > On Friday, March 6, 2020, 8:56:54 AM EST, Martin Maechler 
 >  wrote: 

> Note that the  * -> LaTex -> PDF rendered version looks a bitnicer.

Ah yes, that does indeed look quite a bit nicer.

> I wrote the function and that help page originally.

And thank you for doing so. It is a wonderful function.
(0 sarcasm here).

> For that reason, replacing the well defined precise
> inequality-based definition by *much* less precise English prosa
> is out of the question.

I figured that might be an issue.  Would you be open to 
providing a prose translation, but putting that in the 
details? If so, it would be useful to get feedback on 
what parts of the prose I proposed are imprecise enough 
to be incorrect/incomplete for some corner case.

Finally, would it make sense to move this discussion to
bugzilla?

Best,

Brodie.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] findInterval Documentation Suggestion

2020-03-05 Thread brodie gaslam via R-devel
Trying the attachment as .txt instead of Rd.


On Thursday, March 5, 2020, 5:20:25 PM EST, brodie gaslam via R-devel 
 wrote: 


% File src/library/base/man/findInterval.Rd
% Part of the R package, https://www.R-project.org
% Copyright 1995-2020 R Core Team
% Distributed under GPL 2 or later

\name{findInterval}
\alias{findInterval}
\title{Find Interval Numbers or Indices}
\usage{
findInterval(x, vec, rightmost.closed = FALSE, all.inside = FALSE,
 left.open = FALSE)
}
\arguments{
  \item{x}{numeric.}
  \item{vec}{numeric, sorted (weakly) increasingly, of length \code{N},
say.}
  \item{rightmost.closed}{logical; if true, the rightmost bounded
interval, \code{vec[N-1] .. vec[N]} is treated as \emph{closed},
see below.}
  \item{all.inside}{logical; if true, the returned indices are coerced
into \code{1,\dots,N-1}, i.e., \code{0} is mapped to \code{1}
and \code{N} to \code{N-1}.}
  \item{left.open}{logical; if true all the intervals are open at left
and closed at right; in the description \sQuote{less than or equal to}
becomes \sQuote{strictly less than}, and
\code{rightmost.closed} means \sQuote{leftmost is closed}.  This may
be useful, e.g., in survival analysis computations.}
}
\description{
  Given a vector of non-decreasing values \code{vec}, for each value in
  \code{x} return the highest position in \code{vec} that corresponds
  to a value less than or equal to that \code{x} value, or zero if none
  are.  Equivalently, if the values in \code{vec} are taken to be the
  closed left-bounds of contiguous half-open intervals, return which of
  those intervals each value of \code{x} lies in.
}
\details{
  Under the default parameter values the bounds in \code{vec} define
  intervals that are closed on the left and open on the right:

  \preformatted{
Bound:   -Inf   vec[1]   vec[2] ... vec[N-1]   vec[N]   +Inf
Interval: 0   )[   1   )[   ...   )[   N-1   )[   N
  }

  Intervals \eqn{0} and \eqn{N} are half-bounded, and interval \eqn{0}
  is implicitly defined.  Interval \eqn{0} does not exist if \code{vec}
  includes \eqn{-\infty}, and interval \eqn{N} does not exist if
  \code{vec} includes \eqn{+\infty}.

  \code{left.open=TRUE} reverses which side of the intervals is open,
  \code{rightmost.closed=TRUE} closes interval \eqn{N-1} on both sides
  (or interval \eqn{1} if \code{left.open=TRUE}), and \code{all.inside=TRUE}
  drops bounds \eqn{1} and \eqn{N}, which merges interval \eqn{0} into
  \eqn{1} and interval \eqn{N} into \eqn{N-1}.

  The internal algorithm uses interval search
  ensuring \eqn{O(n \log N)}{O(n * log(N))} complexity where
  \code{n <- length(x)} (and \code{N <- length(vec)}).  For (almost)
  sorted \code{x}, it will be even faster, basically \eqn{O(n)}.
}
\value{
  vector of length \code{length(x)} with values in \code{0:N} (and
  \code{NA}) where \code{N <- length(vec)}, or values coerced to
  \code{1:(N-1)} if and only if \code{all.inside = TRUE} (equivalently coercing 
all
  x values \emph{inside} the intervals).  Note that \code{\link{NA}}s are
  propagated from \code{x}, and \code{\link{Inf}} values are allowed in
  both \code{x} and \code{vec}.
}
\author{Martin Maechler}
\seealso{\code{\link{approx}(*, method = "constant")} which is a
  generalization of \code{findInterval()}, \code{\link{ecdf}} for
  computing the empirical distribution function which is (up to a factor
  of \eqn{n}) also basically the same as \code{findInterval(.)}.
}
\examples{
v <- c(5, 10, 15) # create bins [5,10), [10,15), and [15,+Inf)
x <- c(2, 5, 8, 10, 12, 15, 17)
intervals <- rbind(
  'match(x, v)'=match(x, v),  # `x` values that are on bounds
  x,
  default=  findInterval(x, v),
  rightmost.cl= findInterval(x, v, rightmost.closed=TRUE),
  left.open=findInterval(x, v, left.open=TRUE),
  all.in=   findInterval(x, v, all.inside=TRUE)
)
v
intervals

N <- 100
X <- sort(round(stats::rt(N, df = 2), 2))
tt <- c(-100, seq(-2, 2, length.out = 201), +100)
it <- findInterval(tt, X)
tt[it < 1 | it >= N] # only first and last are outside range(X)

##  'left.open = TRUE' means  "mirroring" :
N <- length(v)
stopifnot(identical(
  findInterval( x,  v,  left.open=TRUE) ,
  N - findInterval(-x, -v[N:1])))
}
\keyword{arith}
\keyword{utilities}
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] findInterval Documentation Suggestion

2020-03-05 Thread brodie gaslam via R-devel
I've found over time that R documentation that comes off as terse at
first blush is usually revealed to be precise, concise, and complete
on close reading.  I'm sure this is also true of `?findInterval`, but
for whatever reason my brain simply refuses to extract meaning from it.

Part of the problem may be that we interact with the function via a
compressed form of the bounds of the intervals (only specify left bounds
for 2nd interval onwards), but the semantics are described mostly in
terms of the intervals themselves.  This requires indirections to map
the parameters to the concepts.

An alternative is to first describe what the function does directly in
terms of its inputs, and subsequent relate that to the intervals.  If I
understand correctly (in default mode) the function can be described as:

 Given a vector of non-decreasing values 'vec', for each value in
 'x' return the highest position in 'vec' that corresponds to a
 value less than or equal to that 'x' value, or zero if none are.
 Equivalently, if the values in 'vec' are taken to be the closed
 left-bounds of contiguous half-open intervals, return which of
 those intervals each value of 'x' lies in.

Compared to the original:

 Given a vector of non-decreasing breakpoints in ‘vec’, find the
 interval containing each element of ‘x’; i.e., if ‘i <-
 findInterval(x,v)’, for each index ‘j’ in ‘x’ v[i[j]] <= x[j] <
 v[i[j] + 1] where v[0] := - Inf, v[N+1] := + Inf, and ‘N <-
 length(v)’.  At the two boundaries, the returned index may differ
 by 1, depending on the optional arguments ‘rightmost.closed’ and
 ‘all.inside’.

Obviously you would be right to question whether someone who claims not
to understand the documentation should venture to re-write it.
Nonetheless I attach a proposed alternate version in the hopes that
someone who clearly understand the original might use or adapt parts of it to
make `?findInterval` more accessible to those comprehension-challenged
like me.


Best,

Brodie
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] survival bug? - solved

2020-03-05 Thread brodie gaslam via R-devel
I _think_ the relevant section of the C standard is 6.5.6 Additive Operators 
Par 8, excerpted here:

> If both the pointer operand and the result point to elements 
> of the same array object, or one past the last element of the 
> array object, the evaluation shall not produce an overflow; 
> otherwise, **the behavior is undefined**.

This is from the [C11 draft][1], though I imagine has been part of the standard 
for a while.  So by doing id[-1], in this case the pointer operand is id, and 
the result is one element _before_ the array object, thus undefined behavior 
which is bad news.

I'm not an expert in these matters though.

[1]: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf


On Thursday, March 5, 2020, 11:39:38 AM EST, Therneau, Terry M., Ph.D. via 
R-devel  wrote: 





 I ended up finding the issue by a focused code review.

Once in the past, I had a version that would fail under one architecture but 
not another, 
in that case some help from Brian Ripley pointed me to the offending line of C 
code.   
That line read, but did not write, at an invalid memory location.   Starting 
with the 
question of "what C routines have I added or modified most recently" along with 
where the 
fault appeared to occur in my test suite, I started reading C code and found 
one.   
Revised code passes tests on the winbuilder site.

For the curious, I had a line asking "is this patient id different than the 
last patient 
id" in the C routine underneath survcheck(); I'm making sure that patients 
don't go 
backwards in time. Essentially
 for (i=0; i< n; i) {
     if (id[i] != id[i-1] )  { ...}

It is still a surprise to me that just LOOKING at this out of range element 
would cause a 
failure,  [i-1] never appears on the left hand side of any expressions in the 
... chunk 
above. Nevertheless, it was an error.   Que sera sera

A strong thanks to those who gave solid suggestions for bringing up a local 
copy of Windows.

Terry T

>>> My latest submission of survival3.1-10 to CRAN fails  a check, but only on
>>> windows, which
>>> I don't use.
>>> How do I track this down?
>>> The test in question works fine on my Linux box.
>>>
>>> Terry
>>>
>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel

>>>
>>     [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>


    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Another wish (?) for R 4.0.0: print(*, width = )

2020-01-07 Thread brodie gaslam via R-devel
 For whatever my 2c are worth I think this would be nice.  I'm still 
uncomfortable at having to call `options` in my package `diffobj` to set output 
width.

And since the topic is here, what about `show`?  Feels like it should accept 
`...` so that it too could be given some set of standard or non standard 
parameters, including `width`.
Happy new decade.
B.

On Tuesday, January 7, 2020, 6:58:32 AM EST, Martin Maechler 
 wrote:  
 
 One of the things I often wish R would work with:

When calling print() explicitly --- as I do not so rarely, e.g.,
specifying  digits =  ---
it sometimes seems awkward that from the printing options() ,
one can specify 'digits' and it has default  digits = NULL which is
documented to be equivalent to  digits = getOption("digits"),
but one cannot specify 'width'
... well "even worse": one *can* specify 'width = .' but it is
silently ignored - as well documented  on  ?print.default

Before considering to add this for R 4.0.0, doing the work
myself, I'd quickly wanted to hear opinions / caveats / .. about this.

wishing you all a  Happy New Year,
Martin

Martin Maechler
ETH Zurich and R Core Team

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
  
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] Mitigating Stalls Caused by Call Deparse on Error

2019-07-16 Thread brodie gaslam via R-devel
As per Luke's instructions I've updated the [wishlist item][1]
 to include the deparse-on-error issue, and also renamed it to
something more appropriate for its broader scope.

I do share Lionel's concern that the deparse-on-error issue
could get unnecessarily delayed in the hopes of finding a more
comprehensive issue for all the problems. I would really appreciate
 it if and when you all have a chance to look at this issue 
you could consider the possibility of resolving the deparse-on-error
 one separately if a global solution seems out of reach.

Thank you for all your work making R better and more
performant.

Best,

Brodie.


[1]: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17580


On Tuesday, July 16, 2019, 4:53:37 AM EDT, Lionel Henry  
wrote: 





We also have a few other suggestions and wishes about backtrace
storage and display on the one hand, and display of constructed calls
on the other hand. Perhaps it would be better to open a different
wishlist item for traceback() to keep the discussions focused?

FWIW I think deparsing backtraces lazily is a great idea. Displaying 1
line per call by default in interactive sessions, while being able to
get a more exhaustive output by calling `traceback(max.lines = Inf)`,
offers the best of both worlds. Understanding the structure of the
backtrace quickly is often more helpful than having immediate access
to all the information.

Best,
Lionel


> On 15 Jul 2019, at 16:04, Tierney, Luke  wrote:
> 
> Better to add this to the wishlist item. This all needs to be looked
> at together, and nothing is likely to happen until after
> vacation/conference season.  It will disappear from everyone's radar
> if it is just in R_devel.
> 
> Best,
> 
> luke
> 
> On Sun, 14 Jul 2019, brodie gaslam wrote:
> 
>> Luke, thanks for considering the issue.  I would like to
>> try to separate the problem into two parts, as I _think_
>> your comments address primarily part 2 below:
>> 
>> 1. How can we avoid significant and possibly crippling
>>    stalls on error with these non-standard calls.
>> 2. What is the best way to view these non-standard calls.
>> 
>> I agree that issue 2. requires further thought and
>> discussion under a wishlist issue ([on bugzilla now][1]). 
>> While I did raise issue 2., the patch itself makes no
>> attempt to resolve it.
>> 
>> The proposed patch resolves issue 1., which is a big
>> usability problem.  Right now if you have the misfortune of
>> using `do.call` with a big object and trigger an error, you
>> have the choice of waiting a possibly long time for
>> the deparse to complete, or killing your entire R session
>> externally.
>> 
>> It seems a shame to allow a big usability issue for `do.call`
>> to remain when there is a simple solution at hand, especially
>> since the complete deparse of large objects likely serves no
>> purpose in this case. Obviously, if storing the actual calls
>> instead of their deparsed equivalents in .Traceback causes
>> problems I'm not anticipating, then that's different. 
>> Is that the case?
>> 
>> Best,
>> 
>> Brodie.
>> 
>> [1]: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17580
>> 
>> On Sunday, July 14, 2019, 8:52:45 AM EDT, Tierney, Luke 
>>  wrote:
>> 
>> 
>> 
>> 
>> 
>> This is probably best viewed in the context of other issue with
>> displaying calls, such as issues arising from calls constructed in
>> non-standard evaluation contexts. Might be good to move to a wishlist
>> item in bugzilla.
>> 
>> Best,
>> 
>> luke
>> 
>> On Sat, 13 Jul 2019, brodie gaslam via R-devel wrote:
>> 
>>> When large calls cause errors R may stall for extended periods.  This
>>> is particularly likely to happen with `do.call`, as in this example
>>> with a 24 second stall:
>>> 
>>>    x <- runif(1e7)
>>>    system.time(do.call(paste0, list(abs, x)))  # intentional error
>>>    ## Error in (function (..., collapse = NULL)  :
>>>    ##  cannot coerce type 'builtin' to vector of type 'character'
>>>    ## Calls: system.time -> do.call -> 
>>>    ## Timing stopped at: 23.81 0.149 24.04
>>> 
>>>    str(.Traceback)
>>>    ## Dotted pair list of 3
>>>    ##  $ : chr [1:2500488] "(function (..., collapse = NULL) " 
>>>".Internal(paste0(list(...), collapse)))(.Primitive(\"abs\"), 
>>>c(0.718117154669017, " "0.494785501621664, 0.1453434410505, 
>>>0.635028422810137, 0.0353180423844606, " "0.688418723642826, 
>>>0

Re: [Rd] [External] Mitigating Stalls Caused by Call Deparse on Error

2019-07-14 Thread brodie gaslam via R-devel
Luke, thanks for considering the issue.  I would like to
try to separate the problem into two parts, as I _think_ 
your comments address primarily part 2 below:

1. How can we avoid significant and possibly crippling
   stalls on error with these non-standard calls.
2. What is the best way to view these non-standard calls.

I agree that issue 2. requires further thought and 
discussion under a wishlist issue ([on bugzilla now][1]).  
While I did raise issue 2., the patch itself makes no
attempt to resolve it.

The proposed patch resolves issue 1., which is a big
 usability problem.  Right now if you have the misfortune of
 using `do.call` with a big object and trigger an error, you 
have the choice of waiting a possibly long time for 
the deparse to complete, or killing your entire R session
 externally.

It seems a shame to allow a big usability issue for `do.call` 
to remain when there is a simple solution at hand, especially
since the complete deparse of large objects likely serves no
purpose in this case. Obviously, if storing the actual calls
 instead of their deparsed equivalents in .Traceback causes 
problems I'm not anticipating, then that's different.  
Is that the case?

Best,

Brodie.

[1]: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17580

On Sunday, July 14, 2019, 8:52:45 AM EDT, Tierney, Luke 
 wrote: 





This is probably best viewed in the context of other issue with
displaying calls, such as issues arising from calls constructed in
non-standard evaluation contexts. Might be good to move to a wishlist
item in bugzilla.

Best,

luke

On Sat, 13 Jul 2019, brodie gaslam via R-devel wrote:

> When large calls cause errors R may stall for extended periods.  This
> is particularly likely to happen with `do.call`, as in this example
> with a 24 second stall:
>
>     x <- runif(1e7)
>     system.time(do.call(paste0, list(abs, x)))  # intentional error
>     ## Error in (function (..., collapse = NULL)  :
>     ##   cannot coerce type 'builtin' to vector of type 'character'
>     ## Calls: system.time -> do.call -> 
>     ## Timing stopped at: 23.81 0.149 24.04
>
>     str(.Traceback)
>     ## Dotted pair list of 3
>     ##  $ : chr [1:2500488] "(function (..., collapse = NULL) " 
> ".Internal(paste0(list(...), collapse)))(.Primitive(\"abs\"), 
> c(0.718117154669017, " "0.494785501621664, 0.1453434410505, 
> 0.635028422810137, 0.0353180423844606, " "0.688418723642826, 
> 0.889682895969599, 0.728154224809259, 0.292572240810841, " ...
>     ##  $ : chr "do.call(paste0, list(abs, x))"
>     ##  $ : chr "system.time(do.call(paste0, list(abs, x)))"
>
> The first time I noticed this I thought my session had frozen/crashed
> as the standard interrupt ^C does not work during the deparse.  The
> stall happens when on error the call stack is deparsed prior to being
> saved to `.Traceback`.  The deparsing is done by `deparse1m` in native
> code, with the value of `getOption('deparse.max.lines')` which
> defaults to all lines.
>
> Since there is little value to seeing millions of lines of deparsed
> objects in `traceback()`, a simple work-around is to change the
> `deparse.max.lines` value:
>
>     options(deparse.max.lines=1)
>     system.time(do.call(paste0, list(abs, x)))
>     ## Error in (function (..., collapse = NULL)  :
>     ##   cannot coerce type 'builtin' to vector of type 'character'
>     ## Calls: system.time -> do.call -> 
>     ## Timing stopped at: 0 0 0
>
> Unfortunately this will affect all `deparse` calls, and it seems
> undesirable to pre-emptively enable it just for calls that might cause
> large deparses on error.
>
> An alternative is to store the actual calls instead of their deparsed
> character equivalents in `.Traceback`.  This defers the deparsing to
> when `traceback()` is used.  As per `?traceback`, it should be
> relatively safe to modify `.Traceback` in this way:
>
>> It is undocumented where .Traceback is stored nor that it is
>> visible, and this is subject to change.
>
> Deferring the deparsing to `traceback()` will give us the 
> opportunity to use a different `max.lines` setting as we do here
> with the patch applied:
>
>     system.time(do.call(paste0, list(abs, x)))
>     ## Error in (function (..., collapse = NULL)  :
>     ##   cannot coerce type 'builtin' to vector of type 'character'
>     ## Timing stopped at: 0.028 0 0.029
>
>     system.time(traceback(max.lines=3))
>     ## 3: (function (..., collapse = NULL)
>     ##    .Internal(paste0(list(...), collapse)))(.Primitive("abs"), 
> c(0.535468587651849,
>     ##    0.0540027911774814, 0.732930393889546, 0.565360915614292, 
>

Re: [Rd] strange increase in the reference number

2019-07-13 Thread brodie gaslam via R-devel
Re ENSURE_NAMEDMAX, I am unsure but think this happens in (src/eval.c@492):
 static SEXP forcePromise(SEXP e)
{
    if (PRVALUE(e) == R_UnboundValue) {
    /* ... SNIP ...*/
    val = eval(PRCODE(e), PRENV(e));
    /* ... SNIP ...*/
    SET_PRSEEN(e, 0);
    SET_PRVALUE(e, val);
    ENSURE_NAMEDMAX(val); <<< HERE
    SET_PRENV(e, R_NilValue);
    }
    return PRVALUE(e);
}

as part of the evaluations of the closure.  `forcePromise` is called ineval 
(src/eval.c@656).  It's been a while since I've looked at the mechanicsof how 
the native version of `eval` works so I could be completely wrong.

B.

PS: line references are in r-devel@76287.


On Friday, July 12, 2019, 4:38:06 PM EDT, Gabriel Becker 
 wrote: 





Hi Jiefei and Duncan,

I suspect what is likely happening is that one of  ENSURE_NAMEDMAX or
MARK_NOT_MUTABLE are being hit for x. These used to set named to 3, but now
set it to 7 (ie the previous and current NAMEDMAX  value, respectively).

Because these are macros rather than C functions, its not easy to figure
out why one of them is being invoked from do_isvector  (a cursory
exploration didn't reveal what was going on, at least to me) and I don't
have the time to dig super deeply into this right now,  but perhaps Luke or
Tomas know why this is happening of the top of their head.

Sorry I can't be of more help.

~G



On Fri, Jul 12, 2019 at 11:47 AM Duncan Murdoch 
wrote:

> On 12/07/2019 1:22 p.m., King Jiefei wrote:
> > Hi,
> >
> > I just found a strange increase in the reference number and I'm wondering
> > if there is any reason for it, here is the code.
> >
> >> a=c(1,2,3)
> >> .Internal(inspect(a))
> > @0x1bf0b9b0 14 REALSXP g0c3 [NAM(1)] (len=3, tl=0) 1,2,3
> >> is.vector(a)
> > [1] TRUE
> >> .Internal(inspect(a))
> > @0x1bf0b9b0 14 REALSXP g0c3 [NAM(7)] (len=3, tl=0) 1,2,3
> >
> > The variable *a* initially has one reference number, after calling
> > *is.vector* function, the reference number goes to 7, which I believe is
> > the highest number that is allowed in R.  I also tried the other R
> > functions, *is.atomic, is.integer* and *is.numeric* do not increase the
> > reference number, but *typeof *will do. Is it intentional?
>
> is.vector() is a closure that calls .Internal.  is.atomic(),
> is.integer() and is.numeric() are all primitives.
>
> Generally speaking closures that call .Internal are easier to implement
> (e.g. is.vector can use the regular mechanism to set a default for its
> second argument), but less efficient in CPU time.  From it's help page,
> it appears that the logic for is.vector() is a lot more complex than for
> the others, so that implementation does make sense.
>
> So why does NAMED go to 7?  Initially, the vector is bound to a.  Within
> is.vector, it is bound to the local variable x.  At this point there are
> two names bound to the same object, so it has to be considered
> immutable.  There's really no difference between any of the values of 2
> or more in the memory manager.  (But see
> http://developer.r-project.org/Refcnt.html for some plans.  That
> document is from about 5 years ago; I don't know the current state.)
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

    [[alternative HTML version deleted]]


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] strange increase in the reference number

2019-07-13 Thread brodie gaslam via R-devel
 If you would like more details I wrote about this recently:
https://www.brodieg.com/2019/02/18/an-unofficial-reference-for-internal-inspect/
Basically as soon as you hit a closure R assumes that theargument might still 
have a surviving reference to iteven after the closure evaluation ends because 
theclosure environment still exists.  This is why thenamed count is set to the 
maximum value.

Note that since the time I wrote the above andnow the maximum reference count 
changed to 7 from 3.  Atsome point I believe there was a plan to add 
true(er?)reference counting in 3.6 to but that was not quiteready in time.

On Friday, July 12, 2019, 2:47:41 PM EDT, Duncan Murdoch 
 wrote:  
 
 On 12/07/2019 1:22 p.m., King Jiefei wrote:
> Hi,
> 
> I just found a strange increase in the reference number and I'm wondering
> if there is any reason for it, here is the code.
> 
>> a=c(1,2,3)
>> .Internal(inspect(a))
> @0x1bf0b9b0 14 REALSXP g0c3 [NAM(1)] (len=3, tl=0) 1,2,3
>> is.vector(a)
> [1] TRUE
>> .Internal(inspect(a))
> @0x1bf0b9b0 14 REALSXP g0c3 [NAM(7)] (len=3, tl=0) 1,2,3
> 
> The variable *a* initially has one reference number, after calling
> *is.vector* function, the reference number goes to 7, which I believe is
> the highest number that is allowed in R.  I also tried the other R
> functions, *is.atomic, is.integer* and *is.numeric* do not increase the
> reference number, but *typeof *will do. Is it intentional?

is.vector() is a closure that calls .Internal.  is.atomic(), 
is.integer() and is.numeric() are all primitives.

Generally speaking closures that call .Internal are easier to implement 
(e.g. is.vector can use the regular mechanism to set a default for its 
second argument), but less efficient in CPU time.  From it's help page, 
it appears that the logic for is.vector() is a lot more complex than for 
the others, so that implementation does make sense.

So why does NAMED go to 7?  Initially, the vector is bound to a.  Within 
is.vector, it is bound to the local variable x.  At this point there are 
two names bound to the same object, so it has to be considered 
immutable.  There's really no difference between any of the values of 2 
or more in the memory manager.  (But see 
http://developer.r-project.org/Refcnt.html for some plans.  That 
document is from about 5 years ago; I don't know the current state.)

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
  
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Mitigating Stalls Caused by Call Deparse on Error

2019-07-13 Thread brodie gaslam via R-devel
When large calls cause errors R may stall for extended periods.  This
is particularly likely to happen with `do.call`, as in this example 
with a 24 second stall:

    x <- runif(1e7)
    system.time(do.call(paste0, list(abs, x)))  # intentional error
    ## Error in (function (..., collapse = NULL)  : 
    ##   cannot coerce type 'builtin' to vector of type 'character'
    ## Calls: system.time -> do.call -> 
    ## Timing stopped at: 23.81 0.149 24.04

    str(.Traceback)
    ## Dotted pair list of 3
    ##  $ : chr [1:2500488] "(function (..., collapse = NULL) " 
".Internal(paste0(list(...), collapse)))(.Primitive(\"abs\"), 
c(0.718117154669017, " "0.494785501621664, 0.1453434410505, 0.635028422810137, 
0.0353180423844606, " "0.688418723642826, 0.889682895969599, 0.728154224809259, 
0.292572240810841, " ...
    ##  $ : chr "do.call(paste0, list(abs, x))"
    ##  $ : chr "system.time(do.call(paste0, list(abs, x)))"

The first time I noticed this I thought my session had frozen/crashed
as the standard interrupt ^C does not work during the deparse.  The
stall happens when on error the call stack is deparsed prior to being
saved to `.Traceback`.  The deparsing is done by `deparse1m` in native
code, with the value of `getOption('deparse.max.lines')` which
defaults to all lines.

Since there is little value to seeing millions of lines of deparsed
objects in `traceback()`, a simple work-around is to change the
`deparse.max.lines` value:

    options(deparse.max.lines=1)
    system.time(do.call(paste0, list(abs, x)))
    ## Error in (function (..., collapse = NULL)  : 
    ##   cannot coerce type 'builtin' to vector of type 'character'
    ## Calls: system.time -> do.call -> 
    ## Timing stopped at: 0 0 0

Unfortunately this will affect all `deparse` calls, and it seems
undesirable to pre-emptively enable it just for calls that might cause
large deparses on error.

An alternative is to store the actual calls instead of their deparsed
character equivalents in `.Traceback`.  This defers the deparsing to 
when `traceback()` is used.  As per `?traceback`, it should be
relatively safe to modify `.Traceback` in this way:

> It is undocumented where .Traceback is stored nor that it is
> visible, and this is subject to change.

Deferring the deparsing to `traceback()` will give us the 
opportunity to use a different `max.lines` setting as we do here 
with the patch applied:

    system.time(do.call(paste0, list(abs, x)))
    ## Error in (function (..., collapse = NULL)  : 
    ##   cannot coerce type 'builtin' to vector of type 'character'
    ## Timing stopped at: 0.028 0 0.029

    system.time(traceback(max.lines=3))
    ## 3: (function (..., collapse = NULL) 
    ##    .Internal(paste0(list(...), collapse)))(.Primitive("abs"), 
c(0.535468587651849, 
    ##    0.0540027911774814, 0.732930393889546, 0.565360915614292, 
0.544816034380347, 
    ## ...
    ## 2: do.call(paste0, list(abs, x))
    ## 1: system.time(do.call(paste0, list(abs, x)))
    ##    user  system elapsed 
    ##   0.000   0.000   0.003 


More generally, it might be better to have a different smaller default
value for the lines to deparse when calls  are _displayed_ as parts of 
lists, as is the case with `traceback()`, or in `print(sys.calls())` and
similar.

I attach a patch that does this.  I have run some basic tests 
and `make check-devel` passes. I can file an issue on bugzilla 
if that is a better place to have this conversation (assuming there 
is interest in it).

Best,

Brodie

PS: for some reason my mail client is refusing to attach the patch so I paste it
starting on the next line.
Index: src/gnuwin32/Rdll.hide
===
--- src/gnuwin32/Rdll.hide    (revision 76827)
+++ src/gnuwin32/Rdll.hide    (working copy)
@@ -94,6 +94,7 @@
  R_GetMaxNSize
  R_GetMaxVSize
  R_GetTraceback
+ R_GetTracebackParsed
  R_GetVarLocSymbol
  R_GetVarLocValue
  R_HandlerStack
Index: src/include/Defn.h
===
--- src/include/Defn.h    (revision 76827)
+++ src/include/Defn.h    (working copy)
@@ -1296,6 +1296,7 @@
 void NORET ErrorMessage(SEXP, int, ...);
 void WarningMessage(SEXP, R_WARNING, ...);
 SEXP R_GetTraceback(int);
+SEXP R_GetTracebackParsed(int);
 
 R_size_t R_GetMaxVSize(void);
 void R_SetMaxVSize(R_size_t);
Index: src/library/base/R/traceback.R
===
--- src/library/base/R/traceback.R    (revision 76827)
+++ src/library/base/R/traceback.R    (working copy)
@@ -16,9 +16,19 @@
 #  A copy of the GNU General Public License is available at
 #  https://www.R-project.org/Licenses/
 
-.traceback <- function(x = NULL) {
-    if(is.null(x) && !is.null(x <- get0(".Traceback", envir = baseenv(
-    {}
+.traceback <- function(x = NULL, max.lines=getOption("deparse.max.lines")) {
+    if(!(is.numeric(max.lines) && !is.na(max.lines) &&
+ as.integer(max.lines) > 0L)
+   

[Rd] Documentation tweak for ?traceback

2019-07-10 Thread brodie gaslam via R-devel
The addition of `.traceback` in r70207 adds one more function to the call stack 
when invoking `traceback()`.  This changes the output of one of the examples to 
include the error handler call:


> options(error = function() traceback(2))
> foo(2)
[1] 1
Error in bar(2) : object 'a.variable.which.does.not.exist' not found
3: (function () 
   traceback(2))() at #1
2: bar(2) at #1
1: foo(2)

The attached trivial patch cleans up the example so that the above looks like 
it would have under r<70207.Index: src/library/base/man/traceback.Rd
===
--- src/library/base/man/traceback.Rd   (revision 76775)
+++ src/library/base/man/traceback.Rd   (working copy)
@@ -81,6 +81,6 @@
 ## Ah, this is the culprit ...
 
 ## This will print the stack trace at the time of the error.
-options(error = function() traceback(2))
+options(error = function() traceback(3))
 }
 \keyword{programming}
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Use of C++ in Packages

2019-03-30 Thread Brodie Gaslam via R-devel
It's great to see the community mobilize to try to resolve this issue. 
Obviously C++ has become a big part of R extensions, so it would be nice 
to have clear guidelines and tools to be able to use C++ safely with the 
R API.


Unfortunately doing this will probably require a fair bit of work.  If 
R-core where to do this it would take away from other valuable 
improvements they could be making on R itself.  Given there is already a 
supported and documented extension mechanism with access to the R API 
via C, I can see why R-core might be reluctant to divert resources from 
R development to add the same level of support for C++.


Obviously it would be impossible to try to provide better documentation 
and/or mechanisms for C++ extensions without some R-core involvement, 
but it seems like much of the grunt work could be done by others.  I 
unfortunately have no C++ experience so cannot help here, but hopefully 
there are others that have the experience and the recognition in the 
community to offer to help and have their offer accepted.  Perhaps 
R-consortium can even fund, although given the level of expertise 
required here the funding may need to be meaningful.


That seems like the natural step here.  Someone with the qualifications 
to do so either volunteers or is funded to do this, and hopefully R-core 
agrees to provide input and final stamp of approval.  The documentation 
is probably more straightforward, as tools will need more work from 
R-core to integrate.  It is possible R-core may decline to do this, but 
absent someone actually offering to put in the hard work it's all 
theoretical.


Respectfully,

Brodie.

On 3/30/19 3:59 AM, Romain Francois wrote:

tl;dr: we need better C++ tools and documentation.

We collectively know more now with the rise of tools like rchk and improved 
documentation such as Tomas’s post. That’s a start, but it appears that there 
still is a lot of knowledge that would deserve to be promoted to actual 
documentation of best practices.

I think it is important to not equate C++ as a language, and Rcpp.

Also, C++ is not just RAII.

RAII is an important part of how Rcpp was conceived for sure, but it’s not the 
only thing C++ can bring as a language. Templates, lambdas, the stl are 
examples of things that can be used for expressiveness when just accessing data 
without interfering with R, calling R api functions ...

It would be nice that the usual « you should do that only if you know what 
you’re doing » be transformed to precise documentation, and maybe become part 
of some better tool. If precautions have to be taken before calling such and 
such functions: that’s ok. What are they ? Can we embed that in some tool.

  It is easy enough to enscope code that uses potentially jumpy code into a c++ 
lambda. This could be together with recommendations such as the body of the 
lambda shall only use POC data structures.

This is similar to precautions you’d take when writing concurrent code.

Romain


Le 30 mars 2019 à 00:58, Simon Urbanek  a écrit :

Kevin,



On Mar 29, 2019, at 17:01, Kevin Ushey  wrote:

I think it's also worth saying that some of these issues affect C code
as well; e.g. this is not safe:

   FILE* f = fopen(...);
   Rf_eval(...);
   fclose(f);


I fully agree, but developers using C are well aware of the necessity of 
handling lifespan of objects explicitly, so at least there are no surprises.



whereas the C++ equivalent would likely handle closing of the file in the 
destructor. In other words, I think many users just may not be cognizant of the 
fact that most R APIs can longjmp, and what that implies for cleanup of 
allocated resources. R_alloc() may help solve the issue specifically for memory 
allocations, but for any library interface that has a 'open' and 'close' step, 
the same sort of issue will arise.


Well, I hope that anyone writing native code in package is well aware of that 
and will use an external pointer with finalizer to clean up native objects in 
any 3rd party library that are created during the call.



What I believe we should do, and what Rcpp has made steps towards, is make it 
possible to interact with some subset of the R API safely from C++ contexts. 
This has always been possible with e.g. R_ToplevelExec() and 
R_ExecWithCleanup(), and now things are even better with R_UnwindProtect(). In 
theory, as a prototype, an R package could provide a 'safe' C++ interface to 
the R API using R_UnwindProtect() and friends as appropriate, and client 
packages could import and link to that package to gain access to the interface. 
Code generators (as Rcpp Attributes does) can handle some of the pain in these 
interfaces, so that users are mostly insulated from the nitty gritty details.


I agree that we should strive to provide tools that make it safer, but note 
that it still requires participation of the users - they have to use such 
facilities or else they hit the same problem. So we can only fix this for the 
future, but

[Rd] Possible Update to R-internals Manual

2019-02-26 Thread brodie gaslam via R-devel
According the R-ints the only current uses of the `truelength` meta datum is 
for environment hash tables.  Jim Hester just made me aware that R3.4.0 
introduces a new use case: growable vectors.

I attach a patch to the R-ints manual that reflects this change.  The wording 
is obviously just a suggestion.  Additionally, it may be worth moving the 
footnote into the main body of the document now that there are more use cases.

Best,

Brodie.Index: doc/manual/R-ints.texi
===
--- doc/manual/R-ints.texi  (revision 76152)
+++ doc/manual/R-ints.texi  (working copy)
@@ -366,6 +366,9 @@
 
 Bit 4 is turned on to mark S4 objects.
 
+Bit 5 for vectors is used to indicate that the vector is overallocated
+and thus may be growable without a new allocation.
+
 Bits 1, 2, 3, 5 and 6 are used for a @code{CHARSXP} to denote its
 encoding.  Bit 1 indicates that the @code{CHARSXP} should be treated as
 a set of bytes, not necessarily representing a character in any known
@@ -406,16 +409,19 @@
 types are a @code{VECTOR_SEXPREC}, which again consists of the header
 and the same three pointers, but followed by two integers giving the
 length and `true length'@footnote{This is almost unused.  The only
-current use is for hash tables of environments (@code{VECSXP}s), where
+current uses are for hash tables of environments (@code{VECSXP}s), where
 @code{length} is the size of the table and @code{truelength} is the
-number of primary slots in use, and for the reference hash tables in
+number of primary slots in use, for the reference hash tables in
 serialization (@code{VECSXP}s), where @code{truelength} is the number of
-slots in use.} of the vector, and then followed by the data (aligned as
-required: on most 32-bit systems with a 24-byte @code{VECTOR_SEXPREC}
-node the data can follow immediately after the node).  The data are a
-block of memory of the appropriate length to store `true length'
-elements (rounded up to a multiple of 8 bytes, with the 8-byte blocks
-being the `Vcells' referred in the documentation for @code{gc()}).
+slots in use, and for vectors that are over-allocated due to assignment
+past the original length, where @code{length} is the in-use length and
+@code{truelength} is the allocated length.} of the vector, and then
+followed by the data (aligned as required: on most 32-bit systems with a
+24-byte @code{VECTOR_SEXPREC} node the data can follow immediately after
+the node).  The data are a block of memory of the appropriate length to
+store `true length' elements (rounded up to a multiple of 8 bytes, with
+the 8-byte blocks being the `Vcells' referred in the documentation for
+@code{gc()}).
 
 The `data' for the various types are given in the table below.  A lot of
 this is interpretation, i.e.@: the types are not checked.
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible setClassUnion Unloading Issue

2019-01-24 Thread Brodie Gaslam via R-devel

I'm encountering a problem caused by class unions from unloaded
packages that are referenced by the still-loaded subclasses.  In
essence, when the class union is loaded initially, it registers itself
as a super class of the component classes in the S4 cache.  When the
package the class union lives in is unloaded, it does not undo that
registration.  Here is an MRE of the problems it causes:

## the character and NULL basic types, and install it
testpkg <- tempfile()
dir.create(file.path(testpkg, 'R'), recursive=TRUE)
writeLines('', file.path(testpkg, 'NAMESPACE'))
description <- 'Package: testpkg\nTitle: Textpkg\nVersion: 0.1\n'
writeLines(description, file.path(testpkg, 'DESCRIPTION'))
code <- 'setClassUnion("chrOrNULL", c("character", "NULL"))'
writeLines(code, file.path(testpkg, 'R', 'union.R'))
install.packages(testpkg, repos=NULL, type='src')

## Load the package and check validity of a character object
library('testpkg')
validObject('hello')
## [1] TRUE

## Detach, unload, and confirm that superclass is still attached
## to character
detach('package:testpkg', unload=TRUE)
getClassDef('character')
## Class "character" [package "methods"]
##
## No Slots, prototype of class "character"
##
## Extends: "vector", "data.frameRowLabels", "SuperClassMethod", 
"chrOrNULL"

##
## Known Subclasses:
## Class "signature", from data part
## Class "className", from data part
## Class "ObjectsWithPackage", from data part

## `validObject` now fails:
validObject('hello')
## Loading required package: testpkg
## Error in validObject("hello") :
##   invalid class "character" object: superclass "chrOrNULL" not 
defined

##   in the environment of the object's class

## But if we run it again it works
validObject('hello')
## [1] TRUE

There are two issues:

1. Whether the dangling superclass association in the `character` (and
   `NULL`) classes after the detach/unload is desirable.
2. That the attempt to reload the package to retrieve the class
   definition fails after the first call to `validObject`, though not
   the second.

I can see how it may be desirable to keep true superclasses associated
with objects, even if somehow the package that contains them is
unloaded.  The case for class unions _seems_ much weaker. I should not
need to reload my package to check validity of something like
'character'.  In my actual use case I had a class that contained
'character' among other things that I was checking with
`validObject(..., complete=TRUE)`.

I attach a patch that adds a cleanup step to `cacheMetaData` which is
kicked off by the package unload process. I grant I'm not an expert in
these matters and there could very well be an important use case where
removing the references is undesirable.

This brings us to the second issue: even if the first patch is
undesirable, `validObject` fails while attempting to reload the class
definition.  I believe the error that occurs when we try to run
`validObject` the first time happens in the `validObject` ->
`getClassDef` -> `.requirePackage` set of calls.  When
`.requirePackage` does not find our package already loaded, it loads,
attaches it, and returns the _package_ environment.  This is different
to what happens when the package namespace is already loaded, in which
case the _namespace_ environment is returned.

The problem with the _package_ environment is that any unexported
class definitions will not be found there.  This is why `validObject`
fails the first time, but works the second.  The first time the
package is loaded and `getClassDef` is given the package env to look
in, but it can't find 'chrOrNULL' because it is not exported.  The
second time `.requirePackage` finds the namespace from the prior load
and returns that, and this does have the class definition.

Since `.requirePackage` is used much more broadly than
`cacheMetaData`, I am reluctant to submit a patch without feedback
from those more knowledgeable of the inner workings of S4.

I ran `make check-all` on a recent version of R-devel with the patch
applied, and everything came back OK.  However, I'm guessing there
isn't much natural testing of unloading packages, so this probably
requires extra scrutiny.

Best,

Brodie

Index: src/library/methods/R/RMethodUtils.R
===
--- src/library/methods/R/RMethodUtils.R(revision 76003)
+++ src/library/methods/R/RMethodUtils.R(working copy)
@@ -830,6 +830,14 @@
identical(cldef@package, pkg)) {
 .uncacheClass(cl, cldef)
 .removeSuperclassBackRefs(cl, cldef, searchWhere)
+# Adapted from `removeClass` we need to de-register references
+# to union superclasses from the package being detached
+if(is(cldef, 'ClassUnionRepresentation') &&
+   length(cld

Re: [Rd] setClass accepts slot-mismatch between slots and prototype arguments

2019-01-10 Thread brodie gaslam via R-devel
Indeed that was on oversight on my part.  It is surprising that things like 
this work:

> setClass('test', slots=c(a='ANY'), prototype=list(a=NULL, b='hello'))
> new('test')@b
[1] "hello"
> slotNames(new('test'))
[1] "a"

I'm planning a release of diffobj right now so I will fix this, but I agree 
that R should probably throw an error here.

Best,

B.




On Thursday, January 10, 2019, 11:30:13 AM EST, William Dunlap 
 wrote: 





I was installing the 'diffobj' package into TERR and got an error from the call
StyleSummary <- setClass("StyleSummary",
  slots=c(container="ANY", body="ANY", map="ANY"),
  prototype=list(
    container=function(x) sprintf("\n%s\n", paste0(x, collapse="")),
    body=identity,
    detail=function(x) sprintf("\n%s\n", paste0("  ", x, collapse="")),
    map=function(x) sprintf("\n%s", paste0("  ", x, collapse="\n"))
  ))
because the prototype contained components not in the slots list.  R does not 
complain about the mismatch, but new("StyleSummary") does name make something 
with a 'detail' slot.  Should this be an error?

I suspect that the package writer intended to include 'detail' in the slots 
argument.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] grDevices::convertColor and colorRamp(space='Lab') Performance Improvements

2018-10-02 Thread Brodie Gaslam via R-devel
`grDevices::convertColor` performance can be improved by 30-300x with 
small changes to the code.  `colorRamp(space='Lab')` uses `convertColor` 
so it too benefits from substantial performance gains.


`convertColor` vectorizes explicitly over the rows of the input color 
matrix using `apply`. The level-1 patch [1] illustrates a possible 
minimal set of changes to achieve this with just R matrix and vector 
operations.  The changes consist primarily of switching `if/else` to 
`ifelse`, `c` to `cbind`, `sum` to `rowSums`, etc. This results in 
speedups of 30-100x as shown in table 1:


   to
fromApple RGB  sRGB CIE RGB   XYZ  Lab  Luv
  Apple RGBNA  38.355.8  30.3 60.2 56.3
  sRGB   38.7NA55.7  36.5 62.9 52.7
  CIE RGB45.2  44.4  NA  30.6 51.5 43.1
  XYZ73.4  57.569.1NA 92.2 69.0
  Lab46.6  56.665.4  72.0   NA 61.3
  Luv73.2 107.367.3 105.8 97.8   NA

## Table 1:
## Ratios of `grDevices` to 'level-1' patch speeds for 8000 colors
## from each supported color space to all other supported color spaces.

A few incremental changes as detailed in the level-2 patch [2] yield 
additional performance improvements:


   to
fromApple RGB  sRGB CIE RGB   XYZ Lab   Luv
  Apple RGBNA  97.1   106.2  89.0 117  83.4
  sRGB   92.5NA99.4  86.4 120  76.0
  CIE RGB   119.2 184.2  NA  82.2 135  83.4
  XYZ   122.3 209.8   140.9NA 171 148.8
  Lab   166.4 168.2   255.4 288.5  NA 265.1
  Luv   141.7 173.6   279.6 310.1 195NA

## Table 2:
## Ratios of `grDevices` to level-2 patch speeds for 8000 colors
## from each supported color space to all other supported color spaces.

Not shown here is that the patched versions of `convertColor` are faster 
with small inputs as well, though the difference is less marked.


I have posted tests on github [3], along with the results of running 
them on my system [4].  While these tests are not proof that the patches 
do not change the function other than to make it faster, the tests 
results combined with the narrow scope of the changes hopefully provides 
sufficient evidence this is the case.   For those wanting to run the 
tests, installation and testing instructions are on the github landing 
page for this project [5].


There are some minor (in my opinion) changes in behavior that need to be 
discussed:


* Inputs that would previously stop with errors or work inconsistently 
now work consistently (e.g. zero-row inputs or inputs containing 
NA/NaN/Inf).
* Column names are consistently set to the color space initials; these 
were previously inconsistently set / mangled by `c`.


These are discussed at length with examples in [6].

It would be possible to preserve the existing behavior, but doing so 
would require some contortions that serve no other purposes than that. 
Additionally, in many cases the existing behavior is inconsistent across 
different input/output color spaces.  Finally, most of the differences 
involve corner case inputs such as those containing NA/NaN/Inf, or zero 
rows.


I am entirely amenable to modify the patches to preserve existing 
behavior in these cases if that is considered desirable.


These patches should be coordinated with #17473 [7], which I found while 
working on this issue.


---

'level-1' patch:
[1]: 
https://raw.githubusercontent.com/brodieG/grDevices2/level-2/extra/level-1.txt


'level-2' patch:
[2]: 
https://raw.githubusercontent.com/brodieG/grDevices2/level-2/extra/level-2.txt


Tests on github, and the result of running them:
[3]: https://github.com/brodieG/grDevices2/blob/level-2/tests/convertColor.R
[4]: 
https://raw.githubusercontent.com/brodieG/grDevices2/level-2/tests/convertColor.Rout


Github landing page for this project:
[5]: https://github.com/brodieG/grDevices2

Discussion of differences introduces by patches:
[6]: 
https://htmlpreview.github.io/?https://raw.githubusercontent.com/brodieG/grDevices2/level-2/extra/differences.html


Indirectly related bugzilla issue #17473:
[7]: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17473

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Fwd: Bug report: cbind with numeric and raw gives incorrect result

2018-09-25 Thread brodie gaslam via R-devel



For what it's worth the following patch fixes that particular problem on my 
system.  I have not checked very carefully to make sure this does not cause 
other problems, but at a high level it seems to make sense.  In this particular 
part of the code I believe `mode` is taken to be the highest type of "column" 
encountered by `ctype` and based on conditionals it can (I think) be up to 
REALSXP here.  This leads to a `INTEGER(REALSXP)` call, which presumably messes 
up the underlying double bit representation.

Again, I looked at this very quickly so I could be completely wrong, but I did 
at least build R with this patch and then no longer observed the odd behavior 
reported by mikefc.

Index: src/main/bind.c
===
--- src/main/bind.c    (revision 75340)
+++ src/main/bind.c    (working copy)
@@ -1381,11 +1381,16 @@
             MOD_ITERATE1(idx, k, i, i1, {
             LOGICAL(result)[n++] = RAW(u)[i1] ? TRUE : FALSE;
             });
-        } else {
+        } else if (mode == INTSXP) {
             R_xlen_t i, i1;
             MOD_ITERATE1(idx, k, i, i1, {
             INTEGER(result)[n++] = (unsigned char) RAW(u)[i1];
             });
+        } else {
+            R_xlen_t i, i1;
+            MOD_ITERATE1(idx, k, i, i1, {
+            REAL(result)[n++] = (unsigned char) RAW(u)[i1];
+            });
         }
         }
     }






On Tuesday, September 25, 2018, 7:58:31 AM EDT, mikefc 
 wrote: 





Hi there,

using cbind with a numeric and raw argument produces an incorrect result.

I've posted some details below,

kind regards,
Mike.



e.g.
> cbind(0, as.raw(0))
    [,1]          [,2]
[1,]    0 6.950136e-310



A longer example shows that the result is not a rounding error, is not
consistent, and repeated applications get different results.

> cbind(0, as.raw(1:10))
              [,1]          [,2]
[1,]  0.00e+00  0.00e+00
[2,]  0.00e+00  0.00e+00
[3,]  0.00e+00  0.00e+00
[4,]  0.00e+00  0.00e+00
[5,]  0.00e+00  6.950135e-310
[6,] 4.243992e-314  6.950135e-310
[7,] 8.487983e-314  6.324040e-322
[8,] 1.273197e-313  0.00e+00
[9,] 1.697597e-313 -4.343725e-311
[10,] 2.121996e-313  1.812216e-308


This bug occurs on
* mac os (with R 3.5.1)
* linux (with R 3.4.4)
* Windows (with R 3.5.0)




My Session Info
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/
A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/
Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats    graphics  grDevices utils    datasets  methods  base

other attached packages:
[1] memoise_1.1.0  ggplot2_3.0.0  nonogramp_0.1.0 purrr_0.2.5
dplyr_0.7.6

loaded via a namespace (and not attached):
[1] Rcpp_0.12.18    rstudioapi_0.7  bindr_0.1.1      magrittr_1.5
tidyselect_0.2.4 munsell_0.5.0    colorspace_1.3-2 R6_2.2.2
rlang_0.2.1.9000 stringr_1.3.1    plyr_1.8.4      tools_3.5.1
grid_3.5.1
[14] packrat_0.4.9-3  gtable_0.2.0    withr_2.1.2      digest_0.6.15
lazyeval_0.2.1  assertthat_0.2.0 tibble_1.4.2    crayon_1.3.4
bindrcpp_0.2.2  pryr_0.1.4      codetools_0.2-15 glue_1.3.0
labeling_0.3
[27] stringi_1.2.4    compiler_3.5.1  pillar_1.3.0    scales_0.5.0
pkgconfig_2.0.1

    [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible bug with chromatic adaptation in grDevices::convertColor

2018-09-13 Thread brodie gaslam via R-devel
It appears that the chromatic adaptation feature of `grDevices::convertColor`is 
broken, and likely has been for many years.  While a little surprising, it is 
an obscure enough feature that there is some possibility this is actually 
broken, as opposed to user error on my part.  If it turns out to the latter, I 
apologize in advance for spamming this list.
Consider:
    rgb.in <- c("#CC", "#EE")    clr <- t(col2rgb(rgb.in)) / 255    
clr.out <- convertColor(clr, "sRGB", "sRGB")    rgb(clr.out)    ## [1] 
"#CC" "#EE"
    convertColor(clr, "sRGB", "sRGB", "D65", "D50")    ## Error in 
match.arg(from, nWhite) :    ##   'arg' must be NULL or a character vector
This appears to be because `grDevices:::chromaticAdaptation` expects the 
whitepoints to be provided in the character format (e.g. "D65"), but they are 
already converted by `convertColor` into the tristimulus values.  After 
applying the patch at the end of this e-mail, we get:
    clr.out <- convertColor(clr, "sRGB", "sRGB", "D65", "D50")    rgb(clr.out)  
  ## [1] "#DACAB0" "#FEECCE"
I do not have a great way of confirming that the conversion is correct with my 
changes, but I did verify that the `M` matrix computed 
within`grDevics:::chromaticAdaptation` for the "D65"->"D50" conversion 
(approximately) matches the corresponding matrix from brucelindbloom.com 
chromatic adaptation page:
http://www.brucelindbloom.com/Eqn_ChromAdapt.html
Additionally visual inspection via
     scales::show_col(c(rgb.in, rgb(clr.out)))
is consistent with a shift from bluer indirect daylight ("D65") to yellower 
direct daylight ("D50") illuminant.
It is worth noting that the adaption method currently 
in`grDevices:::chromaticAdaptation` appears to be the "Von Kries" method, not 
the "Bradford" method as documented in `?convertColor` and in the comments of 
thesources.  I base this on comparing the cone response domain matrices on the 
aforementioned brucelindbloom.com page to the `Ma` matrix defined 
in`grDevics:::chromaticAdaptation`.
Given that brucelindbloom.com appears to recommend "Bradford", that the sources 
suggest that was the intent, that `chromaticAdaptation` is only used 
by`convertColor` in the R sources, that `chromaticAdapation` is not exported, 
and that that feature appears currently inaccessible via `convertColor`, it may 
be worth using this opportunity to change the adaptation method to "Bradford".
A suggested patch follows.  It is intended to minimize the required changes, 
although doing so requires a double transposition.  The transpositions could be 
easily avoided, but it would require reformulating the calculations 
in`chromaticAdaption`.
Best,
Brodie.

Index: src/library/grDevices/R/convertColor.R
===
--- src/library/grDevices/R/convertColor.R    (revision 75298)
+++ src/library/grDevices/R/convertColor.R    (working copy)
@@ -81,7 +81,7 @@
 }
 
 chromaticAdaptation <- function(xyz, from, to) {
-    ## bradford scaling algorithm
+    ## Von Kries scaling algorithm
 Ma <- matrix(c( 0.40024, -0.22630, 0.,
 0.70760,  1.16532, 0.,
    -0.08081,  0.04570, 0.91822), nrow = 3L, byrow = TRUE)
@@ -242,8 +242,8 @@
   if (is.null(from.ref.white))
   from.ref.white <- to.ref.white
 
-  from.ref.white <- c2to3(white.points[, from.ref.white])
-  to.ref.white   <- c2to3(white.points[, to.ref.white])
+  from.ref.white.3 <- c2to3(white.points[, from.ref.white])
+  to.ref.white.3   <- c2to3(white.points[, to.ref.white])
 
   if (is.null(nrow(color)))
 color <- matrix(color, nrow = 1L)
@@ -262,19 +262,19 @@
   rgb
   }
 
-  xyz <- apply(color, 1L, from$toXYZ, from.ref.white)
+  xyz <- apply(color, 1L, from$toXYZ, from.ref.white.3)
 
   if (is.null(nrow(xyz)))
 xyz <- matrix(xyz, nrow = 1L)
 
-  if (!isTRUE(all.equal(from.ref.white, to.ref.white))) {
+  if (!isTRUE(all.equal(from.ref.white.3, to.ref.white.3))) {
   mc <- match.call()
   if (is.null(mc$from.ref.white) || is.null(mc$to.ref.white))
   warning("color spaces use different reference whites")
-  xyz <- chromaticAdaptation(xyz, from.ref.white, to.ref.white)
+  xyz <- t(chromaticAdaptation(t(xyz), from.ref.white, to.ref.white))
   }
 
-  rval <- apply(xyz, 2L, to$fromXYZ, to.ref.white)
+  rval <- apply(xyz, 2L, to$fromXYZ, to.ref.white.3)
 
   if (inherits(to,"RGBcolorConverter"))
   rval <- trim(rval)




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Possible `substr` bug in UTF-8 Corner Case

2018-03-29 Thread brodie gaslam via R-devel
Thank you for the quick response and for the quick fix (and for the rchk 
vagrant image I used to build and test the below!).
One thing I'll note about the fix is that it may start breaking things that 
used to "work".  I think it is fair to say that character count is not well 
defined with illegal UTF-8 sequences (and noteworthy that `nchar` does actually 
stop when it encounters them), but there may be a bit of code out there that 
relied on being able to successfully complete (albeit while potentially 
corrupting memory) that will now produce errors.  It may be worth highlighting 
this in the release notes.

Best,
Brodie.


On Thursday, March 29, 2018, 9:11:15 AM EDT, Tomas Kalibera 
 wrote:  
 
 Thanks, fixed in R-devel (by checking validity of UTF-8 strings for 
substr/substring).
Tomas

On 03/29/2018 03:53 AM, brodie gaslam via R-devel wrote:
> I think there is a memory bug in `substr` that is triggered by a UTF-8 corner 
> case: an incomplete UTF-8 byte sequence at the end of a string.  With a 
> valgrind level 2 instrumented build of R-devel I get:
>
>> string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
>> Encoding(string) <- "UTF-8"
>> substr(string, 1, 10)
> ==15375== Invalid read of size 1
> ==15375==    at 0x45B3F0: substr (character.c:286)
> ==15375==    by 0x45B3F0: do_substr (character.c:342)
> ==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==    by 0x4D9561: Rf_eval (eval.c:747)
> ==15375==    by 0x507008: Rf_ReplIteration (main.c:258)
> ==15375==    by 0x5073E7: R_ReplConsole (main.c:308)
> ==15375==    by 0x507494: run_Rmainloop (main.c:1082)
> ==15375==    by 0x41A8E6: main (Rmain.c:29)
> ==15375==  Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 
> alloc'd
> ==15375==    at 0x4C2DB8F: malloc (in 
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==15375==    by 0x51033E: GetNewPage (memory.c:888)
> ==15375==    by 0x511FC0: Rf_allocVector3 (memory.c:2691)
> ==15375==    by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
> ==15375==    by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
> ==15375==    by 0x4657AC: coerceToVectorList (coerce.c:892)
> ==15375==    by 0x4657AC: Rf_coerceVector (coerce.c:1293)
> ==15375==    by 0x4660EB: ascommon (coerce.c:1369)
> ==15375==    by 0x4667C0: do_asvector (coerce.c:1544)
> ==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
> ==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
> ==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
> ==15375==    by 0x515EF7: dispatchMethod (objects.c:408)
> ==15375==    by 0x516379: Rf_usemethod (objects.c:458)
> ==15375==    by 0x516694: do_usemethod (objects.c:543)
> ==15375==
> [1] "abc"
>
> Here is a patch for the native version of `substr` that highlights the 
> problem and a possible fix.  Basically `substr` computes the byte width of a 
> UTF-8 character based on the leading byte ("\xEE" here, which implies 3 
> bytes) and reads/writes that entire byte width irrespective of whether the 
> string actually ends before the theoretical end of the UTF-8 "character".
>
> Index: src/main/character.c
> ===
> --- src/main/character.c(revision 74482)
> +++ src/main/character.c(working copy)
> @@ -283,7 +283,7 @@
>  for (i = 0; i < so && str < end; i++) {
>  int used = utf8clen(*str);
>  if (i < sa - 1) { str += used; continue; }
> -for (j = 0; j < used; j++) *buf++ = *str++;
> +for (j = 0; j < used && str < end; j++) *buf++ = *str++;
>  }
>   } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
>  for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;
>
> The change above removed the valgrind error for me.  I re-built R with the 
> change and ran "make check" which seemed to work fine. I also ran some simple 
> checks on UTF-8 strings and things seem to work okay.
>
> I have very limited experience making changes to R (this is my first attempt 
> at a patch) so please take all of the above with extreme skepticism.
>
> Apologies in advance if this turns out to be a false alarm caused by an error 
> on my part.
>
> Best,
>
> Brodie.
>
> PS: apologies also if the formatting of this e-mail is bad.  I have not 
> figured out how to get plaintext working properly with yahoo.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


  
[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Possible `substr` bug in UTF-8 Corner Case

2018-03-28 Thread brodie gaslam via R-devel
I think there is a memory bug in `substr` that is triggered by a UTF-8 corner 
case: an incomplete UTF-8 byte sequence at the end of a string.  With a 
valgrind level 2 instrumented build of R-devel I get:

> string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
> Encoding(string) <- "UTF-8"
> substr(string, 1, 10)
==15375== Invalid read of size 1
==15375==    at 0x45B3F0: substr (character.c:286)
==15375==    by 0x45B3F0: do_substr (character.c:342)
==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==    by 0x4D9561: Rf_eval (eval.c:747)
==15375==    by 0x507008: Rf_ReplIteration (main.c:258)
==15375==    by 0x5073E7: R_ReplConsole (main.c:308)
==15375==    by 0x507494: run_Rmainloop (main.c:1082)
==15375==    by 0x41A8E6: main (Rmain.c:29)
==15375==  Address 0xb9e518d is 3,869 bytes inside a block of size 7,960 alloc'd
==15375==    at 0x4C2DB8F: malloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15375==    by 0x51033E: GetNewPage (memory.c:888)
==15375==    by 0x511FC0: Rf_allocVector3 (memory.c:2691)
==15375==    by 0x4657AC: Rf_allocVector (Rinlinedfuns.h:577)
==15375==    by 0x4657AC: Rf_ScalarString (Rinlinedfuns.h:1007)
==15375==    by 0x4657AC: coerceToVectorList (coerce.c:892)
==15375==    by 0x4657AC: Rf_coerceVector (coerce.c:1293)
==15375==    by 0x4660EB: ascommon (coerce.c:1369)
==15375==    by 0x4667C0: do_asvector (coerce.c:1544)
==15375==    by 0x4CFCB9: bcEval (eval.c:6775)
==15375==    by 0x4D95AF: Rf_eval (eval.c:624)
==15375==    by 0x4DAD12: R_execClosure (eval.c:1764)
==15375==    by 0x515EF7: dispatchMethod (objects.c:408)
==15375==    by 0x516379: Rf_usemethod (objects.c:458)
==15375==    by 0x516694: do_usemethod (objects.c:543)
==15375== 
[1] "abc"

Here is a patch for the native version of `substr` that highlights the problem 
and a possible fix.  Basically `substr` computes the byte width of a UTF-8 
character based on the leading byte ("\xEE" here, which implies 3 bytes) and 
reads/writes that entire byte width irrespective of whether the string actually 
ends before the theoretical end of the UTF-8 "character".

Index: src/main/character.c
===
--- src/main/character.c(revision 74482)
+++ src/main/character.c(working copy)
@@ -283,7 +283,7 @@
for (i = 0; i < so && str < end; i++) {
int used = utf8clen(*str);
if (i < sa - 1) { str += used; continue; }
-for (j = 0; j < used; j++) *buf++ = *str++;
+for (j = 0; j < used && str < end; j++) *buf++ = *str++;
}
 } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;

The change above removed the valgrind error for me.  I re-built R with the 
change and ran "make check" which seemed to work fine. I also ran some simple 
checks on UTF-8 strings and things seem to work okay.

I have very limited experience making changes to R (this is my first attempt at 
a patch) so please take all of the above with extreme skepticism.

Apologies in advance if this turns out to be a false alarm caused by an error 
on my part.

Best,

Brodie.

PS: apologies also if the formatting of this e-mail is bad.  I have not figured 
out how to get plaintext working properly with yahoo.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Illegal Logical Values

2017-10-20 Thread brodie gaslam via R-devel
I'm wondering if WRE Section 5.2 should be a little more explicit about misuse 
of integer values other than NA, 0, and 1 in LGLSXPs.  I'm thinking of this 
passage:

> Logical values are sent as 0 (FALSE), 1 (TRUE) or INT_MIN = -2147483648 (NA, 
> but only if NAOK is true), and the compiled code should return one of these 
> three values. (Non-zero values other than INT_MIN are mapped to TRUE.) 

The parenthetical seems to suggest that something like 'LOGICAL(x)[0] = 2;' 
will be treated as TRUE, which it sometimes is, and sometimes isn't:

not.true <- inline::cfunction(body='
  SEXP res = allocVector(LGLSXP, 1);
  LOGICAL(res)[0] = 2;
  return res;'
)()
not.true
## [1] TRUE
not.true == TRUE
## [1] FALSE
not.true[1] == TRUE  # due to scalar subset handling
## [1] TRUE
not.true == 2L
## [1] TRUE


Perhaps a more explicit warning that using anything other than 0, 1, or NA is 
undefined behavior is warranted?  Obviously people should know better than to 
expect correct behavior, but the fact that the behavior is correct in some 
cases (e.g. printing, scalar subsetting) might be confusing.

This is based off of Drew Schmidt's accidental discovery yesterday: 
.
Best,
Brodie.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] `c` with lists with "bytes" names encoding

2017-08-04 Thread brodie gaslam via R-devel
I'm not entirely sure this even qualifies as a bug given how unusual a case it 
is:
> x <- list('a')
> name.x <- '\x81'
> Encoding(name.x) <- 'bytes'
> names(x) <- name.x
> x
$`\\x81`[1] "a"> c(x)Error: translating strings with "bytes" encoding is not 
allowed
> unlist(x)
Error in unlist(x) : 
  translating strings with "bytes" encoding is not allowed> letters[name.x]
[1] NA
> letters[[name.x]]
Error: translating strings with "bytes" encoding is not allowed
Presumably the error is coming from `translateCharsUTF8` or similar.  I imagine 
the "fix" would be to not translate byte encoded names, or not allow them.

This is not really a problem for me as I can easily work around it.  Just an 
fyi in case it is of interest to you.

Best regards,

Brodie.
P.S.: SessionInfo():
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: 
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: 
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] diffobj_0.1.6.9002  unitizer_1.4.2.9003

loaded via a namespace (and not attached):
[1] vetr_0.1.0.9000 compiler_3.4.0  tools_3.4.0 withr_1.0.2
[5] crayon_1.3.2memoise_1.1.0   digest_0.6.12   devtools_1.12.0

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] `match.call` and dots substitution

2017-04-30 Thread brodie gaslam via R-devel
I'm noticing some interesting behavior in `match.call` in some corner-ish cases 
that arise when you try to use `match.call` to match a "grandparent" function 
and there are dots involved:


fun0 <- function(a, ...) fun1(...)
   fun1 <- function(b, ...) fun2()
fun2 <- function()
  match.call(
  fun1, sys.call(sys.parent()), expand.dots=FALSE, 

envir=parent.frame(2)
  )
fun0(1, 2, 3)

## fun1(b = 2, ... = pairlist(3))

fun0((1), (2), (3))   # the parens make the substituted args language

## fun1(b = ..1, ... = pairlist(..2))


When the values in dots that need to be substituted are language, they get 
substituted by `..n` where `n` is the position in dots.  When they are scalars 
they are substituted with the scalar.  It appears this is done in 
'R-3.4.0:src/main/unique.c@1319' in `subDots`:

while (TYPEOF(t) == PROMSXP)
t = PREXPR(t);
if( isSymbol(t) || isLanguage(t) )
  SETCAR(b, installDDVAL(i));
else
  SETCAR(b, t);


I'm not sure why it is necessary to use `installDDVAL`, which creates the `..n` 
symbols, instead of duplicating the language object and attaching that to the 
matched call.  Certainly from a user perspective I would prefer to see:

## fun1(b = (2), ... = pairlist((3)))

instead of:

   ## fun1(b = ..1, ... = pairlist(..2))


Do others agree?  Is there a structural reason why this cannot be done?  Note 
this behavior has been this way since at least R3.0 and probably earlier.


Best regards,

B.

PS: sessionInfo():

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: 
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: 
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

loaded via a namespace (and not attached):
[1] compiler_3.4.0

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Changes in error reporting in r-devel

2016-11-27 Thread brodie gaslam via R-devel
I'm running of off the rocker/drd docker image.  As soon as that updates to a 
fresher version I'll run again, although given what both of you see I must have 
caught some fleeting state in the particular revision I'm on.

Best regards,

B.




- Original Message -
From: Dirk Eddelbuettel 
To: Duncan Murdoch 
Cc: brodie gaslam ; "r-devel@r-project.org" 

Sent: Sunday, November 27, 2016 2:44 PM
Subject: Re: [Rd] Changes in error reporting in r-devel


On 27 November 2016 at 13:20, Duncan Murdoch wrote:
| On 27/11/2016 11:34 AM, brodie gaslam via R-devel wrote:
| > Minor issue, but the following changed as of R3.3.2 from:
| >
| > > a <- function() b()
| > > a()
| > Error in a() : could not find function "b"
| >
| > To (at least in R Under development (unstable) (2016-11-20 r71670)):
| >
| > Error in b() : could not find function "b"
| >
| > Notice the "Error in **b**() :" part.  The original error message seems 
more correct to me, although I can see how you can argue for either one.
| 
| I'm not seeing that in R-devel r71694.  Could you update and try again?

I am at r71690 and I don't see it either:

edd@max:~/src/debian RD

R Under development (unstable) (2016-11-24 r71690) -- "Unsuffered Consequences"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

R> a <- function() b()
R> a()

Error in a() : could not find function "b"
R> 

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Changes in error reporting in r-devel

2016-11-27 Thread brodie gaslam via R-devel
Minor issue, but the following changed as of R3.3.2 from:

> a <- function() b()
> a()
Error in a() : could not find function "b"

To (at least in R Under development (unstable) (2016-11-20 r71670)):

Error in b() : could not find function "b"

Notice the "Error in **b**() :" part.  The original error message seems more 
correct to me, although I can see how you can argue for either one.

Best regards,

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Frames in compiled functions

2016-11-11 Thread brodie gaslam via R-devel
I noticed some problems that cropped in the latest versions of R-devel 
(2016-11-08 r71639 in my case) for one of my packages.  I _think_ I have 
narrowed it down to the changes to what gets byte-compiled by default.  The 
following example run illustrates the problem I'm having:

  compiler::enableJIT(0)
  fun <- function(x) local(as.list(parent.frame(2)))
  fun(1)
  ## $x
  ## [1] 1
  ## 

  

  compiler::cmpfun(fun)(1)
  ## 


Is this considered problematic at all?  If so, might it make sense to broaden 
the list of functions that disable JIT compilation beyond `browser`?  I'm 
pretty sure I can work around this issue in my specific use case, but figured 
it is worth mentioning here since it is a change in behavior.


Regards,

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] A package that traces base functions

2016-09-29 Thread brodie gaslam via R-devel
I'm working on a package that implements a REPL.  A typical interaction with 
the package might look like:
> launch_REPL()REPL> 1 + 1[1] 2REPL> QDo you wish to save results? [y/n]REPL> 
> ygoodbye ...>
This is very similar to what happens when in `browser()`: the REPL evaluates 
arbitrary R user expressions and offers some additional commands.
In order to implement functionality required for the REPL I must trace some 
functions in the base package.  The trace is removed `on.exit()` from the REPL, 
so the functions are only modified while the `launch_REPL` function is 
evaluating.  Unfortunately this is against the letter of the law (as per CRAN 
policy):
> A package must not tamper with the code already loaded into R: anyattempt to 
> change code in the standard and recommended packages whichship with R is 
> prohibited.
Is there any chance that this very limited (only during my function evaluation) 
modification of base functions with `trace` could be considered to meet the 
spirit of the law, if not the letter?  Package users would be duly notified 
this is happening.

Regards,
Brodie Gaslam.
PS: More details for those who care: the REPL among other things implements an 
environment that has for parent `as.environment(2)` so that objects in the 
global environment are not visible while in the REPL, but otherwise the full 
search path is.  Anytime the search path changes I need to update the REPL 
environment to re-point to `as.environment(2)`, which means I need to know when 
the search path changes.  I do this by tracing `library`/`attach`/`detach` and 
triggering a side effect that updates the REPL environment parent any time 
those are called.  The search path itself is untouched.  I cannot just parse 
user expressions searching for those functions as the user can use any 
arbitrary expressions, including sourcing files that contain the `library`, 
etc. calls.  


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] match.call enhancements in 3.2.2

2015-10-30 Thread brodie gaslam via R-devel
I was thrilled to see this in the change log:
> match.call() gains an envir argument for specifying the environment from 
> which to retrieve the ... in the call, if any; this environment was wrong (or 
> at least undesirable) when the definition argument was a function.
This was an issue I ran into a few years ago.  I ended up writing my own 
version that allows specification of the environment, but being able to rely on 
the built in R version seems much better.  That said, one issue remains as far 
as I can tell: if the `...` values are expressions they are substituted by 
`..1`, `..2`, etc.  For example:
> fun0 <- function(...) {
>   fun_gpar <- function(b, ...) {
> fun_par <- function(a)
>   match.call(fun_gpar, sys.call(sys.parent()), envir=parent.frame(2L))
> fun_par()
>   }
>   fun_gpar(...)
> }
> fun0(999, 2 + 2, 3 * pi(), 4)

Returns:
> fun_gpar(b = 999, ..2, ..3, 4)
Instead of:

> fun_gpar(b = 999, 2 + 2, 3 * pi(), 4)
This is better than what used to happen before we had access to the `envir` 
parameter:

> fun_gpar(b = ..1, ..2, 4)
The dot substitution happens in `subDots` in `main/src/unique.c@1228` (in 3.2.2 
sources available as of today in CRAN).  Here is a snippet from that function:
>    for(a = dots, b = rval, i = 1; i <= len; a = CDR(a), b = CDR(b), i++) {
>        SET_TAG(b, TAG(a));
>        t = CAR(a);
>        while (TYPEOF(t) == PROMSXP)
>            t = PREXPR(t);
>        if( isSymbol(t) || isLanguage(t) )
>            SETCAR(b, installDDVAL(i));   // <-- HERE
>        else
>            SETCAR(b, t);
>    }

The `installDDVAL(i)` generates the `..X` symbols.  I'm sure there is a very 
good reason why this is done, but then it does lead to the issues above.
Am I just doing something wrong?
My workaround (https://github.com/brodieG/matchcall) addresses both the `..X` 
issue as well as the environment issue, but I'd prefer to deprecate it in favor 
of a base R solution if possible.
B.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Inconsistent Parse Behavior

2014-12-25 Thread brodie gaslam via R-devel
Under some specific conditions, `parse` seems to produce inconsistent and 
potentially incorrect results the first time it is run in a fresh clean R 
session.  Consider this code where we parse the same text twice in a row, and 
get one value in the parse data that is mismatched:
```Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> txt <- 'c("", {
+   c(integer(3L), 1:3)
+   c(integer(), 1:3, 1L) # TRUE
+   c(integer(), c(1, 2, 3), 1L)  # TRUE
+ } )
+ c("", {
+   lst <- list(list( 1,  2), list( 3, list( 4, list( 5, list(6, 6.1, 6.2)
+ } )
+ c("", {
+   TRUE
+ } )'
> prs1 <- parse(text=txt, keep.source=TRUE)
> prs2 <- parse(text=txt, keep.source=TRUE)
> which(attr(prs1, "srcfile")$parseData != attr(prs2, "srcfile")$parseData)
[1] 1176
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 
```This discrepancy does not happen if I simplify the code to parse in any way. 
 The code as it is is a much simplified version of the code that first produced 
the error for me.  I cannot reduce it further without also eliminating the 
error.
Unfortunately, the discrepancy is meaningful.  The problem is the first parse.  
Looking at `getParseData` output:```> subset(getParseData(prs1), id %in% c(226, 
234))
    line1 col1 line2 col2  id parent token terminal text
226 6    1 8    3 226    234  expr    FALSE 
234 9    5 9    5 234    251   ',' TRUE    ,```Notice how item 226 
has for parent item 234 that starts on line 9, col 5, after item 226 ends.  I'm 
not sure how this is possible.
In the second parse, the parse data is as one would expect:```> 
subset(getParseData(prs2), id == 226)
    line1 col1 line2 col2  id parent token terminal text
226 6    1 8    3 226  0  expr    FALSE    
```The parent here is the top level (0), as would be expected looking at the 
source code in `txt` (226 represents the second `c(...)` block).
I suspect the problem is caused by the use of `{}` inside of `f()`, but again, 
it is not that simple since any further simplification of my code above seems 
to resolve the problem.  I also don't know why it would work fine the second 
time, though there must be some state initialization inside the parser going on.
Any help appreciated.
Best,
Brodie


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel