Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-22 Thread Tomas Kalibera



On 6/21/21 9:25 PM, Bill Dunlap wrote:

NULL cannot be in an integer or numeric vector so it would not be a good
fit for substring's 'first' or 'last' argument (or substr's 'start' and
'stop').


Yes, that would only work if used as a scalar, such as in the default 
for 'last' where 100L is used now.


In other cases, users already had to provide their own values for 'last' 
explicitly, and hence they would know if they provided a value too small 
given their data.



  Also, it is conceivable that string lengths may be 64 bit
integers in the future, so why not use Inf as the default?  Then the
following would give 4 identical results with no warning:


Yes, that would work also in vector use, but integers over 2^53 won't be 
representable as doubles exactly, so we would  have to revisit/change 
the interface when moving to 64 bit integers.


Yet another option would be say using -1, that would also work with 
vector use and integers. But, negative indexes (and zero) are now 
treated as start of the string (1), and while not documented, perhaps 
this is good/intuitive behavior.


Tomas


substring("abcde", 3, c(10, 2^31-1, 2^31, Inf))

[1] "cde" "cde" NANA
Warning message:
In substring("abcde", 3, c(10, 2^31 - 1, 2^31, Inf)) :
   NAs introduced by coercion to integer range

-Bill

On Mon, Jun 21, 2021 at 10:22 AM Michael Chirico 
wrote:


Thanks all, great points well taken. Indeed it seems the default of
100 predates SVN tracking in 1997.

I think a NULL default behaving as "end of string" regardless of
encoding makes sense and avoids the overheads of a $ call and a much
heavier nchar() calculation.

Mike C

On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler
 wrote:

Tomas Kalibera
 on Mon, 21 Jun 2021 10:08:37 +0200 writes:

 > On 6/21/21 9:35 AM, Martin Maechler wrote:
 >>> Michael Chirico
 >>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
 >> > Currently, substring defaults to last=100L, which
 >> > strongly suggests the intent is to default to "nchar(x)"
 >> > without having to compute/allocate that up front.
 >>
 >> > Unfortunately, this default makes no sense for "very
 >> > large" strings which may exceed 100L in "width".
 >>
 >> Yes;  and I tend to agree with you that this default is outdated
 >> (Remember :  R was written to work and run on 2 (or 4?) MB of RAM

on the

 >> student lab  Macs in Auckland in ca 1994).
 >>
 >> > The max width of a string is .Machine$integer.max-1:
 >>
 >> (which Brodie showed was only almost true)
 >>
 >> > So it seems to me either .Machine$integer.max or
 >> > .Machine$integer.max-1L would be a more sensible default. Am I

missing

 >> > something?
 >>
 >> The "drawback" is of course that .Machine$integer.max  is still
 >> a function call (as R beginners may forget) contrary to L,
 >> but that may even be inlined by the byte compiler (? how would we

check ?)

 >> and even if it's not, it does more clearly convey the concept
 >> and idea  *and* would probably even port automatically if ever
 >> integer would be increased in R.

 > We still have the problem that we need to count characters, not

bytes,

 > if we want the default semantics of "until the end of the string".

 > I think we would have to fix this either by really using
 > "nchar(type="c"))" or by using e.g. NULL and then treating this as

a

 > special case, that would be probably faster.

 > Tomas

You are right, as always, Tomas.
I agree that would be better and we should do it if/when we change
the default there.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-21 Thread Bill Dunlap
NULL cannot be in an integer or numeric vector so it would not be a good
fit for substring's 'first' or 'last' argument (or substr's 'start' and
'stop').  Also, it is conceivable that string lengths may be 64 bit
integers in the future, so why not use Inf as the default?  Then the
following would give 4 identical results with no warning:

> substring("abcde", 3, c(10, 2^31-1, 2^31, Inf))
[1] "cde" "cde" NANA
Warning message:
In substring("abcde", 3, c(10, 2^31 - 1, 2^31, Inf)) :
  NAs introduced by coercion to integer range

-Bill

On Mon, Jun 21, 2021 at 10:22 AM Michael Chirico 
wrote:

> Thanks all, great points well taken. Indeed it seems the default of
> 100 predates SVN tracking in 1997.
>
> I think a NULL default behaving as "end of string" regardless of
> encoding makes sense and avoids the overheads of a $ call and a much
> heavier nchar() calculation.
>
> Mike C
>
> On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler
>  wrote:
> >
> > > Tomas Kalibera
> > > on Mon, 21 Jun 2021 10:08:37 +0200 writes:
> >
> > > On 6/21/21 9:35 AM, Martin Maechler wrote:
> > >>> Michael Chirico
> > >>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
> > >> > Currently, substring defaults to last=100L, which
> > >> > strongly suggests the intent is to default to "nchar(x)"
> > >> > without having to compute/allocate that up front.
> > >>
> > >> > Unfortunately, this default makes no sense for "very
> > >> > large" strings which may exceed 100L in "width".
> > >>
> > >> Yes;  and I tend to agree with you that this default is outdated
> > >> (Remember :  R was written to work and run on 2 (or 4?) MB of RAM
> on the
> > >> student lab  Macs in Auckland in ca 1994).
> > >>
> > >> > The max width of a string is .Machine$integer.max-1:
> > >>
> > >> (which Brodie showed was only almost true)
> > >>
> > >> > So it seems to me either .Machine$integer.max or
> > >> > .Machine$integer.max-1L would be a more sensible default. Am I
> missing
> > >> > something?
> > >>
> > >> The "drawback" is of course that .Machine$integer.max  is still
> > >> a function call (as R beginners may forget) contrary to L,
> > >> but that may even be inlined by the byte compiler (? how would we
> check ?)
> > >> and even if it's not, it does more clearly convey the concept
> > >> and idea  *and* would probably even port automatically if ever
> > >> integer would be increased in R.
> >
> > > We still have the problem that we need to count characters, not
> bytes,
> > > if we want the default semantics of "until the end of the string".
> >
> > > I think we would have to fix this either by really using
> > > "nchar(type="c"))" or by using e.g. NULL and then treating this as
> a
> > > special case, that would be probably faster.
> >
> > > Tomas
> >
> > You are right, as always, Tomas.
> > I agree that would be better and we should do it if/when we change
> > the default there.
> >
> > Martin
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-21 Thread Michael Chirico
Thanks all, great points well taken. Indeed it seems the default of
100 predates SVN tracking in 1997.

I think a NULL default behaving as "end of string" regardless of
encoding makes sense and avoids the overheads of a $ call and a much
heavier nchar() calculation.

Mike C

On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler
 wrote:
>
> > Tomas Kalibera
> > on Mon, 21 Jun 2021 10:08:37 +0200 writes:
>
> > On 6/21/21 9:35 AM, Martin Maechler wrote:
> >>> Michael Chirico
> >>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
> >> > Currently, substring defaults to last=100L, which
> >> > strongly suggests the intent is to default to "nchar(x)"
> >> > without having to compute/allocate that up front.
> >>
> >> > Unfortunately, this default makes no sense for "very
> >> > large" strings which may exceed 100L in "width".
> >>
> >> Yes;  and I tend to agree with you that this default is outdated
> >> (Remember :  R was written to work and run on 2 (or 4?) MB of RAM on 
> the
> >> student lab  Macs in Auckland in ca 1994).
> >>
> >> > The max width of a string is .Machine$integer.max-1:
> >>
> >> (which Brodie showed was only almost true)
> >>
> >> > So it seems to me either .Machine$integer.max or
> >> > .Machine$integer.max-1L would be a more sensible default. Am I 
> missing
> >> > something?
> >>
> >> The "drawback" is of course that .Machine$integer.max  is still
> >> a function call (as R beginners may forget) contrary to L,
> >> but that may even be inlined by the byte compiler (? how would we 
> check ?)
> >> and even if it's not, it does more clearly convey the concept
> >> and idea  *and* would probably even port automatically if ever
> >> integer would be increased in R.
>
> > We still have the problem that we need to count characters, not bytes,
> > if we want the default semantics of "until the end of the string".
>
> > I think we would have to fix this either by really using
> > "nchar(type="c"))" or by using e.g. NULL and then treating this as a
> > special case, that would be probably faster.
>
> > Tomas
>
> You are right, as always, Tomas.
> I agree that would be better and we should do it if/when we change
> the default there.
>
> Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-21 Thread Martin Maechler
> Tomas Kalibera 
> on Mon, 21 Jun 2021 10:08:37 +0200 writes:

> On 6/21/21 9:35 AM, Martin Maechler wrote:
>>> Michael Chirico
>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
>> > Currently, substring defaults to last=100L, which
>> > strongly suggests the intent is to default to "nchar(x)"
>> > without having to compute/allocate that up front.
>> 
>> > Unfortunately, this default makes no sense for "very
>> > large" strings which may exceed 100L in "width".
>> 
>> Yes;  and I tend to agree with you that this default is outdated
>> (Remember :  R was written to work and run on 2 (or 4?) MB of RAM on the
>> student lab  Macs in Auckland in ca 1994).
>> 
>> > The max width of a string is .Machine$integer.max-1:
>> 
>> (which Brodie showed was only almost true)
>> 
>> > So it seems to me either .Machine$integer.max or
>> > .Machine$integer.max-1L would be a more sensible default. Am I missing
>> > something?
>> 
>> The "drawback" is of course that .Machine$integer.max  is still
>> a function call (as R beginners may forget) contrary to L,
>> but that may even be inlined by the byte compiler (? how would we check 
?)
>> and even if it's not, it does more clearly convey the concept
>> and idea  *and* would probably even port automatically if ever
>> integer would be increased in R.

> We still have the problem that we need to count characters, not bytes, 
> if we want the default semantics of "until the end of the string".

> I think we would have to fix this either by really using 
> "nchar(type="c"))" or by using e.g. NULL and then treating this as a 
> special case, that would be probably faster.

> Tomas

You are right, as always, Tomas.
I agree that would be better and we should do it if/when we change
the default there.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-21 Thread Tomas Kalibera



On 6/21/21 9:35 AM, Martin Maechler wrote:

Michael Chirico
 on Sun, 20 Jun 2021 15:20:26 -0700 writes:

 > Currently, substring defaults to last=100L, which
 > strongly suggests the intent is to default to "nchar(x)"
 > without having to compute/allocate that up front.

 > Unfortunately, this default makes no sense for "very
 > large" strings which may exceed 100L in "width".

Yes;  and I tend to agree with you that this default is outdated
(Remember :  R was written to work and run on 2 (or 4?) MB of RAM on the
  student lab  Macs in Auckland in ca 1994).

 > The max width of a string is .Machine$integer.max-1:

   (which Brodie showed was only almost true)

 > So it seems to me either .Machine$integer.max or
 > .Machine$integer.max-1L would be a more sensible default. Am I missing
 > something?

The "drawback" is of course that .Machine$integer.max  is still
a function call (as R beginners may forget) contrary to L,
but that may even be inlined by the byte compiler (? how would we check ?)
and even if it's not, it does more clearly convey the concept
and idea  *and* would probably even port automatically if ever
integer would be increased in R.


We still have the problem that we need to count characters, not bytes, 
if we want the default semantics of "until the end of the string".


I think we would have to fix this either by really using 
"nchar(type="c"))" or by using e.g. NULL and then treating this as a 
special case, that would be probably faster.


Tomas



Martin




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-21 Thread Martin Maechler
> Michael Chirico 
> on Sun, 20 Jun 2021 15:20:26 -0700 writes:

> Currently, substring defaults to last=100L, which
> strongly suggests the intent is to default to "nchar(x)"
> without having to compute/allocate that up front.

> Unfortunately, this default makes no sense for "very
> large" strings which may exceed 100L in "width".

Yes;  and I tend to agree with you that this default is outdated
(Remember :  R was written to work and run on 2 (or 4?) MB of RAM on the
 student lab  Macs in Auckland in ca 1994).

> The max width of a string is .Machine$integer.max-1:

  (which Brodie showed was only almost true)

> So it seems to me either .Machine$integer.max or
> .Machine$integer.max-1L would be a more sensible default. Am I missing
> something?

The "drawback" is of course that .Machine$integer.max  is still
a function call (as R beginners may forget) contrary to L,
but that may even be inlined by the byte compiler (? how would we check ?)
and even if it's not, it does more clearly convey the concept
and idea  *and* would probably even port automatically if ever
integer would be increased in R.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread brodie gaslam via R-devel
> On Sunday, June 20, 2021, 9:29:28 PM EDT, brodie gaslam via R-devel 
>  wrote:
>
>> On Sunday, June 20, 2021, 6:21:22 PM EDT, Michael Chirico 
>>  wrote:
>>
>> The max width of a string is .Machine$integer.max-1:
>
> I think the max width is .Machine$integer.max.  What happened below is a
> bug due to buffer overflow in `strrep`:

Sorry, integer overflow.

>> # works
>> x = strrep(" ", .Machine$integer.max-1L)
>> # fails
>> x = strrep(" ", .Machine$integer.max)
>> Error in strrep(" ", .Machine$integer.max) :
>>   'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)
>> (see also the comment in src/main/character.c: "Character strings in R
>> are less than 2^31-1 bytes, so we use int not size_t.")
>
> FWIW WRE states:
>
>> Note that R character strings are restricted to 2^31 - 1 bytes
>
> This is INT_MAX or .Machine$integer.max, at least on machines for which
> `int` is 32 bits, which I think typical for machines R builds on.   From
> having looked at the code a while ago I think WRE is right (so maybe the
> comment in the code is wrong), but it was a while ago and I haven't tried
> to allocate an INT_MAX long string.

So I tried it on a machine with more memory, and it works:

    > x <- strrep(" ", .Machine$integer.max-1L)
    > x <- paste0(x, " ")
    > nchar(x)
    [1] 2147483647
    > nchar(x) == .Machine$integer.max
    [1] TRUE

B.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread brodie gaslam via R-devel


> On Sunday, June 20, 2021, 6:21:22 PM EDT, Michael Chirico 
>  wrote:
>
> Currently, substring defaults to last=100L, which strongly
> suggests the intent is to default to "nchar(x)" without having to
> compute/allocate that up front.
>
> Unfortunately, this default makes no sense for "very large" strings
> which may exceed 100L in "width".
>
> The max width of a string is .Machine$integer.max-1:

I think the max width is .Machine$integer.max.  What happened below is a
bug due to buffer overflow in `strrep`:

> # works
> x = strrep(" ", .Machine$integer.max-1L)
> # fails
> x = strrep(" ", .Machine$integer.max)
> Error in strrep(" ", .Machine$integer.max) :
>   'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

Notice the very large number that was tried to be Calloc'ed.  That's
(size_t) -1.

The problem is (src/include/R_ext/RS.h@85):

    #define CallocCharBuf(n) (char *) R_chk_calloc((R_SIZE_T) ((n)+1), 
sizeof(char))

The `((n) + 1)` overflows `int` and produces -1 (well, undefined behavior
so who knows), which when cast to size_t produces that very large number
which can't be allocated.

I think this should be:

    #define CallocCharBuf(n) (char *) R_chk_calloc(((R_SIZE_T)(n))+1, 
sizeof(char))

I can reproduce the failure before the change.  After the change I get:

    > x = strrep(" ", .Machine$integer.max)
    Error in strrep(" ", .Machine$integer.max) :
  'Calloc' could not allocate memory (2147483648 of 1 bytes)

I believe this to be the expected result on a machine that doesn't have
enough memory to allocate INT_MAX + 1 bytes, as happens to be the case on
my R build system (it's a VM that gets 2GB total as the host machine can
barely spare that to begin with).

> (see also the comment in src/main/character.c: "Character strings in R
> are less than 2^31-1 bytes, so we use int not size_t.")

FWIW WRE states:

> Note that R character strings are restricted to 2^31 - 1 bytes

This is INT_MAX or .Machine$integer.max, at least on machines for which
`int` is 32 bits, which I think typical for machines R builds on.   From
having looked at the code a while ago I think WRE is right (so maybe the
comment in the code is wrong), but it was a while ago and I haven't tried
to allocate an INT_MAX long string.

Sorry this doesn't answer your original question.

Best,

Brodie.

>
>
> So it seems to me either .Machine$integer.max or
> .Machine$integer.max-1L would be a more sensible default. Am I missing
> something?
>
> Mike C
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread Michael Chirico
Currently, substring defaults to last=100L, which strongly
suggests the intent is to default to "nchar(x)" without having to
compute/allocate that up front.

Unfortunately, this default makes no sense for "very large" strings
which may exceed 100L in "width".

The max width of a string is .Machine$integer.max-1:

# works
x = strrep(" ", .Machine$integer.max-1L)
# fails
x = strrep(" ", .Machine$integer.max)
Error in strrep(" ", .Machine$integer.max) :
  'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

(see also the comment in src/main/character.c: "Character strings in R
are less than 2^31-1 bytes, so we use int not size_t.")

So it seems to me either .Machine$integer.max or
.Machine$integer.max-1L would be a more sensible default. Am I missing
something?

Mike C

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel