Re: [Rd] max on numeric_version with long components

2024-04-27 Thread Kurt Hornik
> Ivan Krylov via R-devel writes:

Indeed, apparently using which.min/which.max on the string encoding is
not good enough.  ? which.min says that x can also be

  an R object for which the internal coercion to ‘double’ works 

and I guess we found a case where it does not work.

I'll look into fixing this, but perhaps we should re-open
 ?

-k

> В Sat, 27 Apr 2024 13:56:58 -0500
> Jonathan Keane  пишет:

>> In devel:
>> > max(numeric_version(c("1.0.1.1", "1.0.3.1",  
>> "1.0.2.1")))
>> [1] ‘1.0.1.1’
>> > max(numeric_version(c("1.0.1.1000", "1.0.3.1000",  
>> "1.0.2.1000")))
>> [1] ‘1.0.3.1000’

> Thank you Jon for spotting this!

> This is an unintended consequence of
> https://bugs.r-project.org/show_bug.cgi?id=18697.

> The old behaviour of max() was to call
> which.max(xtfrm(x)), which first produced a permutation that sorted the
> entire .encode_numeric_version(x). The new behavioiur is to call
> which.max directly on .encode_numeric_version(x), which is faster (only
> O(length(x)) instead of a sort).

> What do the encoded version strings look like?

> x <- numeric_version(c(
>  "1.0.1.1", "1.0.3.1", "1.0.2.1"
> ))
> # Ignore the attributes
> (e <- as.vector(.encode_numeric_version(x)))
> # [1] "101575360400"
> # [2] "103575360400"
> # [3] "102575360400"

> # order(), xtfrm(), sort() all agree that e[2] is the maximum:
> order(e)
> # [1] 1 3 2
> xtfrm(e)
> # [1] 1 3 2
> sort(e)
> # [1] "101575360400"
> # [2] "102575360400"
> # [3] "103575360400"

> # but not which.max:
> which.max(e)
> # [1] 1

> This happens because which.max() converts its argument to double, which
> loses precision:

> (n <- as.numeric(e))
> # [1] 1e+27 1e+27 1e+27
> identical(n[1], n[2])
> # [1] TRUE
> identical(n[3], n[2])
> # [1] TRUE

> Will be curious to know if there is a clever way to keep both the O(N)
> complexity and the full arbitrary precision.

> -- 
> Best regards,
> Ivan

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] max on numeric_version with long components

2024-04-27 Thread Ivan Krylov via R-devel
В Sat, 27 Apr 2024 13:56:58 -0500
Jonathan Keane  пишет:

> In devel:
> > max(numeric_version(c("1.0.1.1", "1.0.3.1",  
> "1.0.2.1")))
> [1] ‘1.0.1.1’
> > max(numeric_version(c("1.0.1.1000", "1.0.3.1000",  
> "1.0.2.1000")))
> [1] ‘1.0.3.1000’

Thank you Jon for spotting this!

This is an unintended consequence of
https://bugs.r-project.org/show_bug.cgi?id=18697.

The old behaviour of max() was to call
which.max(xtfrm(x)), which first produced a permutation that sorted the
entire .encode_numeric_version(x). The new behavioiur is to call
which.max directly on .encode_numeric_version(x), which is faster (only
O(length(x)) instead of a sort).

What do the encoded version strings look like?

x <- numeric_version(c(
 "1.0.1.1", "1.0.3.1", "1.0.2.1"
))
# Ignore the attributes
(e <- as.vector(.encode_numeric_version(x)))
# [1] "101575360400"
# [2] "103575360400"
# [3] "102575360400"

# order(), xtfrm(), sort() all agree that e[2] is the maximum:
order(e)
# [1] 1 3 2
xtfrm(e)
# [1] 1 3 2
sort(e)
# [1] "101575360400"
# [2] "102575360400"
# [3] "103575360400"

# but not which.max:
which.max(e)
# [1] 1

This happens because which.max() converts its argument to double, which
loses precision:

(n <- as.numeric(e))
# [1] 1e+27 1e+27 1e+27
identical(n[1], n[2])
# [1] TRUE
identical(n[3], n[2])
# [1] TRUE

Will be curious to know if there is a clever way to keep both the O(N)
complexity and the full arbitrary precision.

-- 
Best regards,
Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] max on numeric_version with long components

2024-04-27 Thread Jonathan Keane
I've noticed something in R devel which seems a little off and not the
behavior I see in 4.4.0 or earlier versions. With numeric_versions that
have long (>8 digit) final components max and min return the first element
and not the max or min:

In devel:
> max(numeric_version(c("1.0.1.1", "1.0.3.1",
"1.0.2.1")))
[1] ‘1.0.1.1’
> max(numeric_version(c("1.0.1.1000", "1.0.3.1000",
"1.0.2.1000")))
[1] ‘1.0.3.1000’

In 4.4.0:
> max(numeric_version(c("1.0.1.1", "1.0.3.1",
"1.0.2.1")))
[1] ‘1.0.3.1’
> max(numeric_version(c("1.0.1.1000", "1.0.3.1000",
"1.0.2.1000")))
[1] ‘1.0.3.1000’

Is this expected? I've looked in NEWS to see but didn't see anything
referencing this. Happy to submit an issue to bug tracker.

-Jon

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2024-04-27 Thread Kevin Coombes
I was horrified when I saw John Weinstein's article about Excel turning
gene names into dates. Mainly because I had been complaining about that
phenomenon for years, and it never remotely occurred to me that you could
get a publication out of it.

I eventually rectified the situation by publishing "Blasted Cell Line
Names", describing how to match different researchers' recording of the
names of cell lines, by applying techniques for DNA or protein sequence
alignment.

Best,
   Kevin

On Tue, Apr 16, 2024, 4:51 PM Reed A. Cartwright 
wrote:

> Gene names being misinterpreted by spreadsheet software (read.csv is
> no different) is a classic issue in bioinformatics. It seems like
> every practitioner ends up encountering this issue in due time. E.g.
>
> https://pubmed.ncbi.nlm.nih.gov/15214961/
>
> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7
>
> https://www.nature.com/articles/d41586-021-02211-4
>
>
> https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates
>
>
> On Tue, Apr 16, 2024 at 3:46 AM jing hua zhao 
> wrote:
> >
> > Dear R-developers,
> >
> > I came to a somewhat unexpected behaviour of read.csv() which is trivial
> but worthwhile to note -- my data involves a protein named "1433E" but to
> save space I drop the quote so it becomes,
> >
> > Gene,SNP,prot,log10p
> > YWHAE,13:62129097_C_T,1433E,7.35
> > YWHAE,4:72617557_T_TA,1433E,7.73
> >
> > Both read.cv() and readr::read_csv() consider prot(ein) name as
> (possibly confused by scientific notation) numeric 1433 which only alerts
> me when I tried to combine data,
> >
> > all_data <- data.frame()
> > for (protein in proteins[1:7])
> > {
> >cat(protein,":\n")
> >f <- paste0(protein,".csv")
> >if(file.exists(f))
> >{
> >  p <- read.csv(f)
> >  print(p)
> >  if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
> >}
> > }
> >
> > proteins[1:7]
> > [1] "1433B" "1433E" "1433F" "1433G" "1433S" "1433T" "1433Z"
> >
> > dplyr::bind_rows() failed to work due to incompatible types nevertheless
> rbind() went ahead without warnings.
> >
> > Best wishes,
> >
> >
> > Jing Hua
> >
> > __
> > R-devel@r-project.org mailing list
> >
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Should c(..., recursive = TRUE) and unlist(x, recursive = TRUE) recurse into expression vectors?

2024-04-27 Thread Mikael Jagan




On 2024-04-27 10:53 am, Mikael Jagan wrote:

Reading the body of function 'AnswerType' in bind.c, called from 'do_c'
and 'do_unlist', I notice that EXPRSXP and VECSXP are handled identically
in the  recurse = TRUE  case.

A corollary is that  c(recursive = TRUE)  and  unlist(recursive = TRUE)
treat expression vectors like  expression(a, b)  as lists of symbols and
calls.  And since they treat symbols and calls as lists of length 1, we
see:

  > x <- expression(a, b); y <- expression(c, d)
  > c(x, y)
expression(a, b, c, d)
  > c(x, y, recursive = TRUE)
[[1]]
a

[[2]]
b

[[3]]
c

[[4]]
d

My expectation based on the documentation in help("c") and help("unlist")
is that those functions would recurse into lists and pairlists, but _not_
into expression vectors.

  recursive: logical.  If 'recursive = TRUE', the function recursively
descends through lists (and pairlists) combining all their
elements into a vector.

  recursive: logical.  Should unlisting be applied to list components of
'x'?

My feeling is that either:

(1) the behaviour should change, so that both calls to 'c' above give
  the result of type "expression".
(2) the documentation should change to say that expression vectors are
  handled as lists in the recursive case.

Option (2) won't break anything but is a bit awkward because it means
that a type "higher" in the documented hierarchy (... < list < expression)
is coerced to a lower type.



Er - this last comment about Option (2) being awkward can be ignored.  The
expression vector is not itself coerced to a list.  Rather, its non-vector
components are treated as lists of length 1.  And that's well-documented.

If anything, Option (1) is awkward as it would treat two types of generic
vectors, list and expression, asymmetrically ...

I can submit a patch implementing Option (2) in a few days to allow for
comments if any.

Mikael


I'll add here that, confusingly, help("expression") says: "an object of
mode 'expression' is a list".  I understand the author's intent (lists and
expression vectors differ only in the 'type' field of the SEXP header) but
I wonder if substituting "list" with "generic vector" there would cause
less confusion ... ?

Mikael


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Should c(..., recursive = TRUE) and unlist(x, recursive = TRUE) recurse into expression vectors?

2024-04-27 Thread Mikael Jagan

Reading the body of function 'AnswerType' in bind.c, called from 'do_c'
and 'do_unlist', I notice that EXPRSXP and VECSXP are handled identically
in the  recurse = TRUE  case.

A corollary is that  c(recursive = TRUE)  and  unlist(recursive = TRUE)
treat expression vectors like  expression(a, b)  as lists of symbols and
calls.  And since they treat symbols and calls as lists of length 1, we
see:

> x <- expression(a, b); y <- expression(c, d)
> c(x, y)
expression(a, b, c, d)
> c(x, y, recursive = TRUE)
[[1]]
a

[[2]]
b

[[3]]
c

[[4]]
d

My expectation based on the documentation in help("c") and help("unlist")
is that those functions would recurse into lists and pairlists, but _not_
into expression vectors.

recursive: logical.  If 'recursive = TRUE', the function recursively
  descends through lists (and pairlists) combining all their
  elements into a vector.

recursive: logical.  Should unlisting be applied to list components of
  'x'?

My feeling is that either:

(1) the behaviour should change, so that both calls to 'c' above give
the result of type "expression".
(2) the documentation should change to say that expression vectors are
handled as lists in the recursive case.

Option (2) won't break anything but is a bit awkward because it means
that a type "higher" in the documented hierarchy (... < list < expression)
is coerced to a lower type.

I'll add here that, confusingly, help("expression") says: "an object of
mode 'expression' is a list".  I understand the author's intent (lists and
expression vectors differ only in the 'type' field of the SEXP header) but
I wonder if substituting "list" with "generic vector" there would cause
less confusion ... ?

Mikael

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel