Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Richard O'Keefe
This seems unnecessarily complex.  Or rather,
it pushes the complexity into an arcane notation
What we really want is something that says "here is a string,
here is a pattern, give me all the substrings that match."
What we're given is a function that tells us where those
substrings are.

# greg.matches(pattern, text)
# accepts a POSIX regular expression, pattern
# and a text to search in.  Both arguments must be character strings
# (length(...) = 1) not longer vectors of strings.
# It returns a character vector of all the (non-overlapping)
# substrings of text as determined by gregexpr.

greg.matches <- function (pattern, text) {
if (length(pattern) > 1) stop("pattern has too many elements")
if (length(text)> 1) stop(   "text has too many elements")
match.info <- gregexpr(pattern, text)
starts <- match.info[[1]]
stops <- attr(starts, "match.length") - 1 + starts
sapply(seq(along=starts), function (i) {
   substr(text, starts[i], stops[i])
})
}

Given greg.matches, we can do the rest with very simple
and easily comprehended regular expressions.

# parse.chemical(formula)
# takes a simple chemical formula "..." and
# returns a list with components
# $elements -- character -- the atom symbols
# $counts   -- number-- the counts (missing counts taken as 1).
# BEWARE.  This does not handle formulas like "CH(OH)3".

parse.chemical <- function (formula) {
parts <- greg.matches("[A-Z][a-z]*[0-9]*", formula)
elements <- gsub("[0-9]+", "", parts)
counts <- as.numeric(gsub("[^0-9]+", "", parts))
counts <- ifelse(is.na(counts), 1, counts)
list(elements=elements, counts=counts)
}

> parse.chemical("CCl3F")
$elements
[1] "C"  "Cl" "F"

$counts
[1] 1 3 1

> parse.chemical("Li4Al4H16")
$elements
[1] "Li" "Al" "H"

$counts
[1]  4  4 16

> parse.chemical("CCl2CO2AlPO4SiO4Cl")
$elements
 [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"

$counts
 [1] 1 2 1 2 1 1 4 1 4 1


On Thu, 19 Oct 2023 at 03:59, Leonard Mada via R-help 
wrote:

> Dear List members,
>
> What is the best way to test for numeric digits?
>
> suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
> # [1] NA NA NA  2 NA NA  3
> The above requires the use of the suppressWarnings function. Are there
> any better ways?
>
> I was working to extract chemical elements from a formula, something
> like this:
> split.symbol.character = function(x, rm.digits = TRUE) {
>  # Perl is partly broken in R 4.3, but this works:
>  regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>  # stringi::stri_split(x, regex = regex);
>  s = strsplit(x, regex, perl = TRUE);
>  if(rm.digits) {
>  s = lapply(s, function(s) {
>  isNotD = is.na(suppressWarnings(as.numeric(s)));
>  s = s[isNotD];
>  });
>  }
>  return(s);
> }
>
> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))
>
>
> Sincerely,
>
>
> Leonard
>
>
> Note:
> # works:
> regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>
> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fwd: r-stats: Geometric Distribution

2023-10-18 Thread Jim Lemon
Please delete drjimle...@gmail.com from your mailing lists. He passed away
a mknth ago.
Regards,
Juel
Wife

On Tue, 17 Oct 2023, 22:58 Sahil Sharma  -- Forwarded message -
> From: Sahil Sharma 
> Date: Tue, Oct 17, 2023 at 12:10 PM
> Subject: r-stats: Geometric Distribution
> To: 
>
>
> Hey I want to raise one issue in *r-stats **geometric distribution *
> function.
>
> I have found the dgeom(x,p) which denotes probability density function of
> geometric distribution, is not reducing 1 from x.
>
> The original formula for Geometric Distribution PDF is *((1-p)^x-1)*P*.
> However, the current r function *dgeom(x, p)* is doing this: *((1-p)^x)*P,
> *it
> is not reducing 1 from x.
>
> I don't know whether this it is kept as it is intentionally, but I thought
> of just informing you, in case it's an error, so you can correct it.
>
> Thanks.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Jim Lemon
Please delete drjimle...@bitwrit.com from your mailing list. He passed away
a month ago.
Regards,
Juel (wife)

On Thu, 19 Oct 2023, 02:09 Ben Bolker  There are some answers on Stack Overflow:
>
>
> https://stackoverflow.com/questions/14984989/how-to-avoid-warning-when-introducing-nas-by-coercion
>
>
>
> On 2023-10-18 10:59 a.m., Leonard Mada via R-help wrote:
> > Dear List members,
> >
> > What is the best way to test for numeric digits?
> >
> > suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
> > # [1] NA NA NA  2 NA NA  3
> > The above requires the use of the suppressWarnings function. Are there
> > any better ways?
> >
> > I was working to extract chemical elements from a formula, something
> > like this:
> > split.symbol.character = function(x, rm.digits = TRUE) {
> >  # Perl is partly broken in R 4.3, but this works:
> >  regex =
> "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> >  # stringi::stri_split(x, regex = regex);
> >  s = strsplit(x, regex, perl = TRUE);
> >  if(rm.digits) {
> >  s = lapply(s, function(s) {
> >  isNotD = is.na(suppressWarnings(as.numeric(s)));
> >  s = s[isNotD];
> >  });
> >  }
> >  return(s);
> > }
> >
> > split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))
> >
> >
> > Sincerely,
> >
> >
> > Leonard
> >
> >
> > Note:
> > # works:
> > regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
> >
> >
> > # broken in R 4.3.1
> > # only slightly "erroneous" with stringi::stri_split
> > regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> > strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Ivan Krylov
The matching approach is also competitive:

match.symbol2 <- function(x, rm.digits = TRUE) {
 if (rm.digits) stringi::stri_extract_all(x, regex = '[A-Z][a-z]*') else
 lapply(
  stringi::stri_match_all(x, regex = '([A-Z][a-z]*)([0-9]*)'), \(m) {
   m <- t(m[,2:3]); m[nzchar(m)]
  }
 )
}
mol5 <- rep(mol, 5)
system.time(split.symbol.character(mol5))
#   user  system elapsed 
#  1.518   0.000   1.518 
system.time(split_chem_elements(mol5))
#   user  system elapsed 
#  0.435   0.000   0.436 
system.time(match.symbol2(mol5))
#   user  system elapsed 
#  0.117   0.000   0.117 

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Rui Barradas

Às 19:35 de 18/10/2023, Leonard Mada escreveu:

Dear Rui,

On 10/18/2023 8:45 PM, Rui Barradas wrote:

split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
    stringr::str_replace_all(mol, regex, "#") |>
  strsplit("#|[[:digit:]]") |>
  lapply(\(x) x[nchar(x) > 0L])
  } else {
    strsplit(x, regex, perl = TRUE)
  }
}

split.symbol.character = function(x, rm.digits = TRUE) {
  # Perl is partly broken in R 4.3, but this works:
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  s <- strsplit(x, regex, perl = TRUE)
  if(rm.digits) {
    s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
  }
  s
}


You have a glitch (mol is hardcoded) in the code of the first function. 
The times are similar, after correcting for that glitch.


Note:
- grep("[[:digit:]]", ...) behaves almost twice as slow as grep("[0-9]", 
...)!

- corrected results below;

Sincerely,

Leonard
###

split_chem_elements <- function(x, rm.digits = TRUE) {
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   if(rm.digits) {
     stringr::str_replace_all(x, regex, "#") |>
   strsplit("#|[[:digit:]]") |>
   lapply(\(x) x[nchar(x) > 0L])
   } else {
     strsplit(x, regex, perl = TRUE)
   }
}

split.symbol.character = function(x, rm.digits = TRUE) {
   # Perl is partly broken in R 4.3, but this works:
   regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
   s <- strsplit(x, regex, perl = TRUE)
   if(rm.digits) {
     s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
   }
   s
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
mol1 <- rep(mol, 1)

system.time(
   split_chem_elements(mol1)
)
#   user  system elapsed
#   0.58    0.00    0.58

system.time(
   split.symbol.character(mol1)
)
#   user  system elapsed
#   0.67    0.00    0.67


Hello,

You are right, sorry for the blunder :(.
In the code below I have replaced stringr::str_replace_all by the 
package stringi function stri_replace_all_regex and the improvement is 
significant.



split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
stringi::stri_replace_all_regex(x, "#", regex) |>
  strsplit("#|[0-9]") |>
  lapply(\(x) x[nchar(x) > 0L])
  } else {
strsplit(x, regex, perl = TRUE)
  }
}

# system.time(
#   split_chem_elements(mol1)
# )
#  user  system elapsed
#  0.060.000.09
# system.time(
#   split.symbol.character(mol1)
# )
#  user  system elapsed
#  0.250.000.28



Hope this helps,

Rui Barradas




--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Leonard Mada via R-help

Dear Rui,

On 10/18/2023 8:45 PM, Rui Barradas wrote:

split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
    stringr::str_replace_all(mol, regex, "#") |>
  strsplit("#|[[:digit:]]") |>
  lapply(\(x) x[nchar(x) > 0L])
  } else {
    strsplit(x, regex, perl = TRUE)
  }
}

split.symbol.character = function(x, rm.digits = TRUE) {
  # Perl is partly broken in R 4.3, but this works:
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  s <- strsplit(x, regex, perl = TRUE)
  if(rm.digits) {
    s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
  }
  s
}


You have a glitch (mol is hardcoded) in the code of the first function. 
The times are similar, after correcting for that glitch.


Note:
- grep("[[:digit:]]", ...) behaves almost twice as slow as grep("[0-9]", 
...)!

- corrected results below;

Sincerely,

Leonard
###

split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
    stringr::str_replace_all(x, regex, "#") |>
  strsplit("#|[[:digit:]]") |>
  lapply(\(x) x[nchar(x) > 0L])
  } else {
    strsplit(x, regex, perl = TRUE)
  }
}

split.symbol.character = function(x, rm.digits = TRUE) {
  # Perl is partly broken in R 4.3, but this works:
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  s <- strsplit(x, regex, perl = TRUE)
  if(rm.digits) {
    s <- lapply(s, \(x) x[grep("[0-9]", x, invert = TRUE)])
  }
  s
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
mol1 <- rep(mol, 1)

system.time(
  split_chem_elements(mol1)
)
#   user  system elapsed
#   0.58    0.00    0.58

system.time(
  split.symbol.character(mol1)
)
#   user  system elapsed
#   0.67    0.00    0.67

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Rui Barradas

Às 17:24 de 18/10/2023, Leonard Mada escreveu:

Dear Rui,

Thank you for your reply.

I do have actually access to the chemical symbols: I have started to 
refactor and enhance the Rpdb package, see Rpdb::elements:

https://github.com/discoleo/Rpdb

However, the regex that you have constructed is quite heavy, as it needs 
to iterate through all chemical symbols (in decreasing nchar). Elements 
like C, and especially O, P or S, appear late in the regex expression - 
but are quite common in chemistry.


The alternative regex is (in this respect) simpler. It actually works 
(once you know about the workaround).


Q: My question focused if there is anything like is.numeric, but to 
parse each element of a vector.


Sincerely,


Leonard


On 10/18/2023 6:53 PM, Rui Barradas wrote:

Às 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there
any better ways?

I was working to extract chemical elements from a formula, something
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
      # Perl is partly broken in R 4.3, but this works:
      regex = 
"(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";

      # stringi::stri_split(x, regex = regex);
      s = strsplit(x, regex, perl = TRUE);
      if(rm.digits) {
      s = lapply(s, function(s) {
          isNotD = is.na(suppressWarnings(as.numeric(s)));
          s = s[isNotD];
      });
      }
      return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb
PLEASE do read the posting guide
https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK
and provide commented, minimal, self-contained, reproducible code.

Hello,

If you want to extract chemical elements symbols, the following might 
work.

It uses the periodic table in GitHub package chemr and a package stringr
function.


devtools::install_github("paleolimbot/chemr")



split_chem_elements <- function(x) {
    data(pt, package = "chemr", envir = environment())
    el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)]
    pat <- paste(el, collapse = "|")
    stringr::str_extract_all(x, pat)
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"


It is also possible to rewrite the function without calls to non base
packages but that will take some more work.

Hope this helps,

Rui Barradas



Hello,

You and Avi are right, my function's performance is terrible. The 
following is much faster.


As for how to not have digits throw warnings, the lapply in the version 
of your function below solves it by setting grep argument invert = TRUE. 
This will get all strings where digits do not occur.




split_chem_elements <- function(x, rm.digits = TRUE) {
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  if(rm.digits) {
stringr::str_replace_all(mol, regex, "#") |>
  strsplit("#|[[:digit:]]") |>
  lapply(\(x) x[nchar(x) > 0L])
  } else {
strsplit(x, regex, perl = TRUE)
  }
}

split.symbol.character = function(x, rm.digits = TRUE) {
  # Perl is partly broken in R 4.3, but this works:
  regex <- "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])"
  s <- strsplit(x, regex, perl = TRUE)
  if(rm.digits) {
s <- lapply(s, \(x) x[grep("[[:digit:]]+", x, invert = TRUE)])
  }
  s
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"
split.symbol.character(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"

mol1 <- rep(mol, 1)

system.time(
  split_chem_elements(mol1)
)
#>user  system elapsed
#>0.010.00

Re: [R] Best way to test for numeric digits?

2023-10-18 Thread avi.e.gross
Rui,

The problem with searching for elements, as with many kinds of text, is that 
the optimal search order may depend on the probabilities of what is involved. 
There can be more elements added such as Unobtainium in the future with 
whatever abbreviations that may then change the algorithm you may have chosen 
but then again, who actually looks for elements with a negligible half-life?

If you had an application focused on Organic Chemistry, a relatively few of the 
elements would normally be present while for something like electronics 
components of some kind, a different overlapping palette with probabilities can 
be found.

Just how important is the efficiency for you? If this was in a language like 
python, I would consider using a dictionary or set and I think there are 
packages in R that support a version of this.  In your case, one solution can 
be to pre-create a dictionary of all the elements, or just a set, and take your 
word tokens and check if they are in the dictionary/set or not. Any that aren't 
can then be further examined as needed and if your data is set a specific way, 
they may all just end up to be numeric. The cost is the hashing and of course 
memory used. Your corpus of elements is small enough that this may not be as 
helpful as parsing text that can contain many thousands of words.

Even in plain R, you can probably also use something like:

elements = c("H", "He", "Li", ...)
If (text %in% elements) ...

Something like the above may not be faster but can be quite a bit more readable 
than the regular expressions

But plenty of the solutions others offered may well be great for your current 
need.

Some may even work with Handwavium.

-Original Message-
From: R-help  On Behalf Of Leonard Mada via R-help
Sent: Wednesday, October 18, 2023 12:24 PM
To: Rui Barradas ; R-help Mailing List 

Subject: Re: [R] Best way to test for numeric digits?

Dear Rui,

Thank you for your reply.

I do have actually access to the chemical symbols: I have started to 
refactor and enhance the Rpdb package, see Rpdb::elements:
https://github.com/discoleo/Rpdb

However, the regex that you have constructed is quite heavy, as it needs 
to iterate through all chemical symbols (in decreasing nchar). Elements 
like C, and especially O, P or S, appear late in the regex expression - 
but are quite common in chemistry.

The alternative regex is (in this respect) simpler. It actually works 
(once you know about the workaround).

Q: My question focused if there is anything like is.numeric, but to 
parse each element of a vector.

Sincerely,


Leonard


On 10/18/2023 6:53 PM, Rui Barradas wrote:
> Às 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:
>> Dear List members,
>>
>> What is the best way to test for numeric digits?
>>
>> suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
>> # [1] NA NA NA  2 NA NA  3
>> The above requires the use of the suppressWarnings function. Are there
>> any better ways?
>>
>> I was working to extract chemical elements from a formula, something
>> like this:
>> split.symbol.character = function(x, rm.digits = TRUE) {
>>   # Perl is partly broken in R 4.3, but this works:
>>   regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>>   # stringi::stri_split(x, regex = regex);
>>   s = strsplit(x, regex, perl = TRUE);
>>   if(rm.digits) {
>>   s = lapply(s, function(s) {
>>   isNotD = is.na(suppressWarnings(as.numeric(s)));
>>   s = s[isNotD];
>>   });
>>   }
>>   return(s);
>> }
>>
>> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))
>>
>>
>> Sincerely,
>>
>>
>> Leonard
>>
>>
>> Note:
>> # works:
>> regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>>
>>
>> # broken in R 4.3.1
>> # only slightly "erroneous" with stringi::stri_split
>> regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb
>> PLEASE do read the posting guide
>> https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> If you want to extract chemical elements symbols, the following might work.
> It uses the periodic table in GitHub package chemr and a package stringr
> function.
>
>
> devtools::install_github("paleolimbot/chemr")
>
>
>
> split_chem_elements <- 

Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Leonard Mada via R-help

Dear Rui,

Thank you for your reply.

I do have actually access to the chemical symbols: I have started to 
refactor and enhance the Rpdb package, see Rpdb::elements:

https://github.com/discoleo/Rpdb

However, the regex that you have constructed is quite heavy, as it needs 
to iterate through all chemical symbols (in decreasing nchar). Elements 
like C, and especially O, P or S, appear late in the regex expression - 
but are quite common in chemistry.


The alternative regex is (in this respect) simpler. It actually works 
(once you know about the workaround).


Q: My question focused if there is anything like is.numeric, but to 
parse each element of a vector.


Sincerely,


Leonard


On 10/18/2023 6:53 PM, Rui Barradas wrote:

Às 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there
any better ways?

I was working to extract chemical elements from a formula, something
like this:
split.symbol.character = function(x, rm.digits = TRUE) {
      # Perl is partly broken in R 4.3, but this works:
      regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
      # stringi::stri_split(x, regex = regex);
      s = strsplit(x, regex, perl = TRUE);
      if(rm.digits) {
      s = lapply(s, function(s) {
          isNotD = is.na(suppressWarnings(as.numeric(s)));
          s = s[isNotD];
      });
      }
      return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://eu01.z.antigena.com/l/boS9jwics77ZHEe0yO-Lt8AIDZm9-s6afEH4ulMO3sMyE9mLHNAR603_eeHQG2-_t0N2KsFVQRcldL-XDy~dLMhLtJWX69QR9Y0E8BCSopItW8RqG76PPj7ejTkm7UOsLQcy9PUV0-uTjKs2zeC_oxUOrjaFUWIhk8xuDJWb
PLEASE do read the posting guide
https://eu01.z.antigena.com/l/rUSt2cEKjOO0HrIFcEgHH_NROfU9g5sZ8MaK28fnBl9G6CrCrrQyqd~_vNxLYzQ7Ruvlxfq~P_77QvT1BngSg~NLk7joNyC4dSEagQsiroWozpyhR~tbGOGCRg5cGlOszZLsmq2~w6qHO5T~8b5z8ZBTJkCZ8CBDi5KYD33-OK
and provide commented, minimal, self-contained, reproducible code.

Hello,

If you want to extract chemical elements symbols, the following might work.
It uses the periodic table in GitHub package chemr and a package stringr
function.


devtools::install_github("paleolimbot/chemr")



split_chem_elements <- function(x) {
data(pt, package = "chemr", envir = environment())
el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)]
pat <- paste(el, collapse = "|")
stringr::str_extract_all(x, pat)
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"


It is also possible to rewrite the function without calls to non base
packages but that will take some more work.

Hope this helps,

Rui Barradas




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Rui Barradas

Às 15:59 de 18/10/2023, Leonard Mada via R-help escreveu:

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there 
any better ways?


I was working to extract chemical elements from a formula, something 
like this:

split.symbol.character = function(x, rm.digits = TRUE) {
     # Perl is partly broken in R 4.3, but this works:
     regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
     # stringi::stri_split(x, regex = regex);
     s = strsplit(x, regex, perl = TRUE);
     if(rm.digits) {
     s = lapply(s, function(s) {
         isNotD = is.na(suppressWarnings(as.numeric(s)));
         s = s[isNotD];
     });
     }
     return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

Hello,

If you want to extract chemical elements symbols, the following might work.
It uses the periodic table in GitHub package chemr and a package stringr 
function.



devtools::install_github("paleolimbot/chemr")



split_chem_elements <- function(x) {
  data(pt, package = "chemr", envir = environment())
  el <- pt$symbol[order(nchar(pt$symbol), decreasing = TRUE)]
  pat <- paste(el, collapse = "|")
  stringr::str_extract_all(x, pat)
}

mol <- c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl")
split_chem_elements(mol)
#> [[1]]
#> [1] "C"  "Cl" "F"
#>
#> [[2]]
#> [1] "Li" "Al" "H"
#>
#> [[3]]
#>  [1] "C"  "Cl" "C"  "O"  "Al" "P"  "O"  "Si" "O"  "Cl"


It is also possible to rewrite the function without calls to non base 
packages but that will take some more work.


Hope this helps,

Rui Barradas


--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Ivan Krylov
В Wed, 18 Oct 2023 17:59:01 +0300
Leonard Mada via R-help  пишет:

> What is the best way to test for numeric digits?
> 
> suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
> # [1] NA NA NA  2 NA NA  3
> The above requires the use of the suppressWarnings function. Are
> there any better ways?

This test also has the downside of accepting things like "1.2" and
"+1e-100". Since you need digits only, why not use a regular expression
to test for '^[0-9]+$'?

> I was working to extract chemical elements from a formula, something 
> like this:

> split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))

Perhaps the following function could be made to work in your cases?

function(x) regmatches(x, gregexec('([A-Z][a-z]*)([0-9]*)', x))

retval[2,] is the element and retval[3,] is the coefficient. Do you
need brackets? Charges? Non-stoichiometric compounds? (SMILES?)

> # broken in R 4.3.1
> # only slightly "erroneous" with stringi::stri_split
> regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
> strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl =
> T)

strsplit() has special historical behaviour about empty matches:
https://bugs.r-project.org/show_bug.cgi?id=16745

It's unfortunate that it doesn't split on empty matches the way you
would intuitively expect it to, but changing the behaviour at this
point is hard. Even adding a flag may be complicated to implement. Do
you want such a flag?

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Jeff Newmiller via R-help
Use any occurrence of one or more digits as a separator?

s <- c( "CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl" )
strsplit( s, "\\d+" )


On October 18, 2023 7:59:01 AM PDT, Leonard Mada via R-help 
 wrote:
>Dear List members,
>
>What is the best way to test for numeric digits?
>
>suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
># [1] NA NA NA  2 NA NA  3
>The above requires the use of the suppressWarnings function. Are there any 
>better ways?
>
>I was working to extract chemical elements from a formula, something like this:
>split.symbol.character = function(x, rm.digits = TRUE) {
>    # Perl is partly broken in R 4.3, but this works:
>    regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>    # stringi::stri_split(x, regex = regex);
>    s = strsplit(x, regex, perl = TRUE);
>    if(rm.digits) {
>    s = lapply(s, function(s) {
>        isNotD = is.na(suppressWarnings(as.numeric(s)));
>        s = s[isNotD];
>    });
>    }
>    return(s);
>}
>
>split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))
>
>
>Sincerely,
>
>
>Leonard
>
>
>Note:
># works:
>regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>
># broken in R 4.3.1
># only slightly "erroneous" with stringi::stri_split
>regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
>strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Best way to test for numeric digits?

2023-10-18 Thread Ben Bolker

   There are some answers on Stack Overflow:

https://stackoverflow.com/questions/14984989/how-to-avoid-warning-when-introducing-nas-by-coercion



On 2023-10-18 10:59 a.m., Leonard Mada via R-help wrote:

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there 
any better ways?


I was working to extract chemical elements from a formula, something 
like this:

split.symbol.character = function(x, rm.digits = TRUE) {
     # Perl is partly broken in R 4.3, but this works:
     regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
     # stringi::stri_split(x, regex = regex);
     s = strsplit(x, regex, perl = TRUE);
     if(rm.digits) {
     s = lapply(s, function(s) {
         isNotD = is.na(suppressWarnings(as.numeric(s)));
         s = s[isNotD];
     });
     }
     return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Best way to test for numeric digits?

2023-10-18 Thread Leonard Mada via R-help

Dear List members,

What is the best way to test for numeric digits?

suppressWarnings(as.double(c("Li", "Na", "K",  "2", "Rb", "Ca", "3")))
# [1] NA NA NA  2 NA NA  3
The above requires the use of the suppressWarnings function. Are there 
any better ways?


I was working to extract chemical elements from a formula, something 
like this:

split.symbol.character = function(x, rm.digits = TRUE) {
    # Perl is partly broken in R 4.3, but this works:
    regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
    # stringi::stri_split(x, regex = regex);
    s = strsplit(x, regex, perl = TRUE);
    if(rm.digits) {
    s = lapply(s, function(s) {
        isNotD = is.na(suppressWarnings(as.numeric(s)));
        s = s[isNotD];
    });
    }
    return(s);
}

split.symbol.character(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"))


Sincerely,


Leonard


Note:
# works:
regex = "(?<=[A-Z])(?![a-z]|$)|(?<=.)(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)


# broken in R 4.3.1
# only slightly "erroneous" with stringi::stri_split
regex = "(?<=[A-Z])(?![a-z]|$)|(?=[A-Z])|(?<=[a-z])(?=[^a-z])";
strsplit(c("CCl3F", "Li4Al4H16", "CCl2CO2AlPO4SiO4Cl"), regex, perl = T)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.