Re: [Rd] R-devel internal errors during check produce?
> Jan Gorecki writes: > Thank you both, > You are absolutely correct that example should be minimal, so here it is. > l = list(a=new.env(), b=new.env()) > unique(l) > Just for completeness, env_list during check that raises error > env_list <- list(baseenv(), > as.environment("package:graphics"), > as.environment("package:stats"), > as.environment("package:utils"), > as.environment("package:methods") > ) > unique(env_list) Thanks ... but the above work fine for me. E.g., R> l = list(a=new.env(), b=new.env()) R> unique(l) [[1]] [[2]] Best -k > Best regards, > Jan > On Mon, Jun 29, 2020 at 5:42 PM Martin Maechler > wrote: >> >> > Kurt Hornik >> > on Mon, 29 Jun 2020 16:13:03 +0200 writes: >> >> > Jan Gorecki writes: >> >> So the unique.default is from the R tools package during >> >> checks. I don't see those issues on CRAN checks. >> >> > I cannot reproduce this locally (and have no clues about >> > docker). Perhaps you can try to debug this on your end? >> > And see what env_list is when the error occurs? >> >> > Best -k >> >> Indeed, if it is a bug in R (as opposed to being an assumption >> that 'data.table' makes about undocumented R internals), it >> should be reproducible with a very small dummy package instead >> of data.table. ... or actually reproducible with relatively >> simple R code calling unique() not envolving any non base package. >> >> Martin >> >> >> >> Exact environment where I am reproducing this issue is a >> >> fresh ubuntu, no R packages pre-installed docker pull >> >> registry.gitlab.com/jangorecki/dockerfiles/r-devel >> >> https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile >> >> >> On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki >> >> wrote: >> >>> >> >>> Hi R developers, >> >>> >> >>> On R-devel (2020-06-24 r78746) I am getting those two >> >>> new exceptions during R check. I found a change which >> >>> eventually may be related >> >>> https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465 >> >>> I think this may be a regression. I grep'ed package >> >>> manuals and R code for unique.default but don't see >> >>> any. Usage section of the unique method looks fine as >> >>> well. Errors look a little bit like internal errors. >> >>> >> >>> * checking Rd \usage sections ... NOTE Error in >> >>> unique.default(env_list) : LENGTH or similar applied to >> >>> environment object Calls: >> >>> ... .get_S3_generics_as_seen_from_package -> unique -> >> >>> unique.default Execution halted The \usage entries for >> >>> S3 methods should use the \method markup and not their >> >>> full name. * checking S3 generic/method consistency >> >>> ... WARNING Error in unique.default(env_list) : LENGTH >> >>> or similar applied to environment object Calls: >> >>> ... .get_S3_generics_as_seen_from_package -> >> >>> unique -> unique.default >> >>> >> >>> I don't think if it is related but I build R-devel with >> >>> extra args: --with-recommended-packages >> >>> --enable-strict-barrier --disable-long-double I check >> >>> with: --as-cran --no-manual To reproduce download >> >>> current data.table from CRAN (1.12.8) and run R check >> >>> >> >>> Best regards, Jan Gorecki >> >> >> __ >> >> R-devel@r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> > __ >> > R-devel@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [External] Possible ABI change in R 4.0.1
EXTPTR_PTR is not in the API so it is not guaranteed to even exist in the future. The API function for accessing the pointer address is R_ExternalPtrAddr. See Section 5.13 in WRE. Sometimes internals need to be changed, In this case a change was made to deal with a segfault; the commit notice tells you the PR this addressed. As it says in Writing R Extensions about defining USE_RINTERNALS: Also be prepared to adjust your code should R internals change. The same goes for any use of non-API macros and functions. Best, luke On Mon, 29 Jun 2020, Gábor Csárdi wrote: Hi all, it seems that from R 4.0.1 EXTPTR_PTR can be either a macro or a function, depending on whether USE_RINTERNALS is requested. Jeroen helped me find that this was in 78592: https://github.com/wch/r-source/commit/c634fec5214e73747b44d7c0e6f047fefe44667d This is a problem, because binary packages that are built on R 4.0.1 or R 4.0.2 will potentially not load on R 4.0.0, if they use the EXTPTR_PTR function. E.g. this is R 4.0.0 on Linux: library(Rcpp) Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/usr/local/lib/R/library/Rcpp/libs/Rcpp.so': Error relocating /usr/local/lib/R/library/Rcpp/libs/Rcpp.so: EXTPTR_PTR: symbol not found In addition: Warning message: package ‘Rcpp’ was built under R version 4.0.1 It is easiest to reproduce this on Windows, because the CRAN binaries are now built on R 4.0.2, so if you install Rcpp on R 4.0.0 from CRAN, and try to load it you'll get: library(Rcpp) Error: package or namespace load failed for 'Rcpp' in inDL(x, as.logical(local), as.logical(now), ...): unable to load shared object 'C:/Users/csard/R/win-library/4.0/Rcpp/libs/x64/Rcpp.dll': LoadLibrary failure: The specified procedure could not be found. In addition: Warning message: package 'Rcpp' was built under R version 4.0.2 I suppose this change was not intended? Best, Gabor __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R-devel internal errors during check produce?
Thank you both, You are absolutely correct that example should be minimal, so here it is. l = list(a=new.env(), b=new.env()) unique(l) Just for completeness, env_list during check that raises error env_list <- list(baseenv(), as.environment("package:graphics"), as.environment("package:stats"), as.environment("package:utils"), as.environment("package:methods") ) unique(env_list) Best regards, Jan On Mon, Jun 29, 2020 at 5:42 PM Martin Maechler wrote: > > > Kurt Hornik > > on Mon, 29 Jun 2020 16:13:03 +0200 writes: > > > Jan Gorecki writes: > >> So the unique.default is from the R tools package during > >> checks. I don't see those issues on CRAN checks. > > > I cannot reproduce this locally (and have no clues about > > docker). Perhaps you can try to debug this on your end? > > And see what env_list is when the error occurs? > > > Best -k > > Indeed, if it is a bug in R (as opposed to being an assumption > that 'data.table' makes about undocumented R internals), it > should be reproducible with a very small dummy package instead > of data.table. ... or actually reproducible with relatively > simple R code calling unique() not envolving any non base package. > > Martin > > > >> Exact environment where I am reproducing this issue is a > >> fresh ubuntu, no R packages pre-installed docker pull > >> registry.gitlab.com/jangorecki/dockerfiles/r-devel > >> > https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile > > >> On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki > >> wrote: > >>> > >>> Hi R developers, > >>> > >>> On R-devel (2020-06-24 r78746) I am getting those two > >>> new exceptions during R check. I found a change which > >>> eventually may be related > >>> > https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465 > >>> I think this may be a regression. I grep'ed package > >>> manuals and R code for unique.default but don't see > >>> any. Usage section of the unique method looks fine as > >>> well. Errors look a little bit like internal errors. > >>> > >>> * checking Rd \usage sections ... NOTE Error in > >>> unique.default(env_list) : LENGTH or similar applied to > >>> environment object Calls: > >>> ... .get_S3_generics_as_seen_from_package -> unique -> > >>> unique.default Execution halted The \usage entries for > >>> S3 methods should use the \method markup and not their > >>> full name. * checking S3 generic/method consistency > >>> ... WARNING Error in unique.default(env_list) : LENGTH > >>> or similar applied to environment object Calls: > >>> ... .get_S3_generics_as_seen_from_package -> > >>> unique -> unique.default > >>> > >>> I don't think if it is related but I build R-devel with > >>> extra args: --with-recommended-packages > >>> --enable-strict-barrier --disable-long-double I check > >>> with: --as-cran --no-manual To reproduce download > >>> current data.table from CRAN (1.12.8) and run R check > >>> > >>> Best regards, Jan Gorecki > > >> __ > >> R-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"
Did you test with R 4.0.2 or R-devel? A bug related to this issue was recently fixed: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17833 Best, Kevin On Mon, Jun 29, 2020 at 11:51 AM Duncan Murdoch wrote: > > On 29/06/2020 10:39 a.m., Johannes Rauh wrote: > > Dear R Developers, > > > > I noticed that `basename` and `dirname` always return "UTF-8" on Windows > > (tested with R-4.0.0 and R-3.6.3): > > > >> p <- "Föö/Bär" > >> Encoding(p) > > [1] "latin1" > >> Encoding(dirname(p)) > > [1] "UTF-8" > >> Encoding(basename(p)) > > [1] "UTF-8" > > > > Is this on purpose? At least I did not find any relevant comment in the > > documentation of `dirname`/`basename`. > > > > Background: I'm currently struggeling with a directory name containing a > > latin1-character. (I know that this is a bad idea, but I did not create > > the directory and I cannot rename it.) I now want to pass a > > latin1-directory name to a function, which internally uses > > `tools::makeLazyLoadDB`. At that point, internally, `dirname` is called, > > which changes the encoding, and things break. If I use `debug` to halt the > > processing and "fix" the encoding, things work as expected. > > > > So, if possible, I would prefer that `dirname` and `basename` preserve the > > encoding. > > Actually, makeLazyLoadDB isn't exported from tools, so strictly speaking > you shouldn't be calling it. Or perhaps you have a good reason to call > it, and should be asking for it to be exported, or you are calling a > published function which calls it: in either case it should probably be > fixed to accept UTF-8. > > But it doesn't call dirname or basename, so maybe the function that > calls it is the one that needs fixing. > > In any case, while asking dirname() and basename() to preserve the > encoding sounds reasonable, it seems like it would just be covering up a > deeper problem. > > Duncan Murdoch > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Possible ABI change in R 4.0.1
Hi all, it seems that from R 4.0.1 EXTPTR_PTR can be either a macro or a function, depending on whether USE_RINTERNALS is requested. Jeroen helped me find that this was in 78592: https://github.com/wch/r-source/commit/c634fec5214e73747b44d7c0e6f047fefe44667d This is a problem, because binary packages that are built on R 4.0.1 or R 4.0.2 will potentially not load on R 4.0.0, if they use the EXTPTR_PTR function. E.g. this is R 4.0.0 on Linux: > library(Rcpp) Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/usr/local/lib/R/library/Rcpp/libs/Rcpp.so': Error relocating /usr/local/lib/R/library/Rcpp/libs/Rcpp.so: EXTPTR_PTR: symbol not found In addition: Warning message: package ‘Rcpp’ was built under R version 4.0.1 It is easiest to reproduce this on Windows, because the CRAN binaries are now built on R 4.0.2, so if you install Rcpp on R 4.0.0 from CRAN, and try to load it you'll get: > library(Rcpp) Error: package or namespace load failed for 'Rcpp' in inDL(x, as.logical(local), as.logical(now), ...): unable to load shared object 'C:/Users/csard/R/win-library/4.0/Rcpp/libs/x64/Rcpp.dll': LoadLibrary failure: The specified procedure could not be found. In addition: Warning message: package 'Rcpp' was built under R version 4.0.2 I suppose this change was not intended? Best, Gabor __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] `basename` and `dirname` change the encoding to "UTF-8"
On 29/06/2020 10:39 a.m., Johannes Rauh wrote: Dear R Developers, I noticed that `basename` and `dirname` always return "UTF-8" on Windows (tested with R-4.0.0 and R-3.6.3): p <- "Föö/Bär" Encoding(p) [1] "latin1" Encoding(dirname(p)) [1] "UTF-8" Encoding(basename(p)) [1] "UTF-8" Is this on purpose? At least I did not find any relevant comment in the documentation of `dirname`/`basename`. Background: I'm currently struggeling with a directory name containing a latin1-character. (I know that this is a bad idea, but I did not create the directory and I cannot rename it.) I now want to pass a latin1-directory name to a function, which internally uses `tools::makeLazyLoadDB`. At that point, internally, `dirname` is called, which changes the encoding, and things break. If I use `debug` to halt the processing and "fix" the encoding, things work as expected. So, if possible, I would prefer that `dirname` and `basename` preserve the encoding. Actually, makeLazyLoadDB isn't exported from tools, so strictly speaking you shouldn't be calling it. Or perhaps you have a good reason to call it, and should be asking for it to be exported, or you are calling a published function which calls it: in either case it should probably be fixed to accept UTF-8. But it doesn't call dirname or basename, so maybe the function that calls it is the one that needs fixing. In any case, while asking dirname() and basename() to preserve the encoding sounds reasonable, it seems like it would just be covering up a deeper problem. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R-devel internal errors during check produce?
> Kurt Hornik > on Mon, 29 Jun 2020 16:13:03 +0200 writes: > Jan Gorecki writes: >> So the unique.default is from the R tools package during >> checks. I don't see those issues on CRAN checks. > I cannot reproduce this locally (and have no clues about > docker). Perhaps you can try to debug this on your end? > And see what env_list is when the error occurs? > Best -k Indeed, if it is a bug in R (as opposed to being an assumption that 'data.table' makes about undocumented R internals), it should be reproducible with a very small dummy package instead of data.table. ... or actually reproducible with relatively simple R code calling unique() not envolving any non base package. Martin >> Exact environment where I am reproducing this issue is a >> fresh ubuntu, no R packages pre-installed docker pull >> registry.gitlab.com/jangorecki/dockerfiles/r-devel >> https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile >> On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki >> wrote: >>> >>> Hi R developers, >>> >>> On R-devel (2020-06-24 r78746) I am getting those two >>> new exceptions during R check. I found a change which >>> eventually may be related >>> https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465 >>> I think this may be a regression. I grep'ed package >>> manuals and R code for unique.default but don't see >>> any. Usage section of the unique method looks fine as >>> well. Errors look a little bit like internal errors. >>> >>> * checking Rd \usage sections ... NOTE Error in >>> unique.default(env_list) : LENGTH or similar applied to >>> environment object Calls: >>> ... .get_S3_generics_as_seen_from_package -> unique -> >>> unique.default Execution halted The \usage entries for >>> S3 methods should use the \method markup and not their >>> full name. * checking S3 generic/method consistency >>> ... WARNING Error in unique.default(env_list) : LENGTH >>> or similar applied to environment object Calls: >>> ... .get_S3_generics_as_seen_from_package -> >>> unique -> unique.default >>> >>> I don't think if it is related but I build R-devel with >>> extra args: --with-recommended-packages >>> --enable-strict-barrier --disable-long-double I check >>> with: --as-cran --no-manual To reproduce download >>> current data.table from CRAN (1.12.8) and run R check >>> >>> Best regards, Jan Gorecki >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] `basename` and `dirname` change the encoding to "UTF-8"
Dear R Developers, I noticed that `basename` and `dirname` always return "UTF-8" on Windows (tested with R-4.0.0 and R-3.6.3): > p <- "Föö/Bär" > Encoding(p) [1] "latin1" > Encoding(dirname(p)) [1] "UTF-8" > Encoding(basename(p)) [1] "UTF-8" Is this on purpose? At least I did not find any relevant comment in the documentation of `dirname`/`basename`. Background: I'm currently struggeling with a directory name containing a latin1-character. (I know that this is a bad idea, but I did not create the directory and I cannot rename it.) I now want to pass a latin1-directory name to a function, which internally uses `tools::makeLazyLoadDB`. At that point, internally, `dirname` is called, which changes the encoding, and things break. If I use `debug` to halt the processing and "fix" the encoding, things work as expected. So, if possible, I would prefer that `dirname` and `basename` preserve the encoding. Best regards Johannes __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] "R CMD Sweave --driver=..." woes
> Vincent Goulet via R-devel writes: Thanks: fixed now in the trunk with c78751. Best -k > In trying to change the driver used by Sweave on the command line using >R CMD Sweave --driver=foo > I consistently get the "directory 'foo' does not exist' error. (For any value > of 'foo', even the default 'RweaveLatex'.) > Looking up the source code for function .Sweave that is called by 'R CMD > Sweave', I notice that the argument 'driver', if used, is added to the vector > of arguments of ''buildVignette' without being named. It ends up being passed > to argument 'dir', hence rhe error. > I believe the simple patch below should fix the issue, but I wasn't able to > test it. > Hope this helps. > v. > Vincent Goulet > Professeur titulaire > École d'actuariat, Université Laval > Index: src/library/utils/R/Sweave.R > === > --- src/library/utils/R/Sweave.R (revision 78746) > +++ src/library/utils/R/Sweave.R (working copy) > @@ -516,7 +516,7 @@ > do_exit(1L) > } > args <- list(file=file, tangle=FALSE, latex=toPDF, engine=engine, > clean=clean) > -if(nzchar(driver)) args <- c(args, driver) > +if(nzchar(driver)) args <- c(args, driver=driver) > args <- c(args, encoding = encoding) > if(nzchar(options)) { > opts <- eval(str2expression(paste0("list(", options, ")"))) > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R-devel internal errors during check produce?
> Jan Gorecki writes: > So the unique.default is from the R tools package during checks. > I don't see those issues on CRAN checks. I cannot reproduce this locally (and have no clues about docker). Perhaps you can try to debug this on your end? And see what env_list is when the error occurs? Best -k > Exact environment where I am reproducing this issue is a fresh ubuntu, > no R packages pre-installed > docker pull registry.gitlab.com/jangorecki/dockerfiles/r-devel > https://gitlab.com/jangorecki/dockerfiles/-/raw/master/r-devel/Dockerfile > On Sat, Jun 27, 2020 at 12:37 AM Jan Gorecki wrote: >> >> Hi R developers, >> >> On R-devel (2020-06-24 r78746) I am getting those two new exceptions >> during R check. I found a change which eventually may be related >> https://github.com/wch/r-source/commit/69de92b9fb1b7f2a7c8d1394b8d56050881a5465 >> I think this may be a regression. I grep'ed package manuals and R code >> for unique.default but don't see any. Usage section of the unique >> method looks fine as well. Errors look a little bit like internal >> errors. >> >> * checking Rd \usage sections ... NOTE >> Error in unique.default(env_list) : >> LENGTH or similar applied to environment object >> Calls: ... .get_S3_generics_as_seen_from_package -> >> unique -> unique.default >> Execution halted >> The \usage entries for S3 methods should use the \method markup and not >> their full name. >> * checking S3 generic/method consistency ... WARNING >> Error in unique.default(env_list) : >> LENGTH or similar applied to environment object >> Calls: ... .get_S3_generics_as_seen_from_package -> >> unique -> unique.default >> >> I don't think if it is related but I build R-devel with extra args: >> --with-recommended-packages --enable-strict-barrier --disable-long-double >> I check with: >> --as-cran --no-manual >> To reproduce download current data.table from CRAN (1.12.8) and run R check >> >> Best regards, >> Jan Gorecki > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] A warning in gzcon but not in gzfile
Hi all, I used `gzfile` and `gzcon` to read a compressed file but I found that `gzcon` gave me a different result than `gzfile`. It seems like the `gzcon` does not handle the data correctly. I have posted an example below. In the example, a portion of a compressed file is downloaded from Google Cloud as a raw vector, and the data is saved into a temp file. If I use ` gzfile` to read the file, it can show the first 1000 lines successfully. However, if I wrap the raw vector as a connection, and use `gzcon` to read from that connection, it shows the first 884 lines along with a warning(see the output). code: > # installed.packages("BiocManager") > # BiocManager::install("GCSConnection", version = "devel") > library(GCSConnection) > ## Download data from cloud > uri <- > "gs://gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz" > con <- gcs_connection(uri) > data <- readBin(con, raw(), 4*1024*1024) > close(con) > ## write data to a file > file_path <- tempfile() > writeBin(data, file_path) > ## Read the data using `gzfile` > con1 <- gzfile(file_path) > str(readLines(con1, 1000)) > ## Read the data using `gzcon` > ## We create a raw connection from the raw vector > con2 <- gzcon(rawConnection(data)) > str(readLines(con2, 1000)) output: > > str(readLines(con1, 1000)) > chr [1:1000] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" > ... > > str(readLines(con2, 1000)) > chr [1:884] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" ... > Warning message: > In readLines(con2, 1000) : incomplete final line found on 'gzcon(data)' I am not sure if this is caused by a bug in `gzcon` or the misuse of the function. The same result can be observed at R4.0 and R4.1 devel on Win. Here is my session info, I hope it can be helpful. Any suggestions and help would be appreciated. R Under development (unstable) (2020-06-27 r78747) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 10 x64 (build 18363) > Matrix products: default > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United States.1252 > system code page: 65001 Best, Jiefei [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Error in substring: invalid multibyte string
From the user's (or package author's) point, all strings should always be valid in their declared encoding. If they are not, the result of string operations is undefined - it may be an error or warning, but also silently produced correct or incorrect result. There are R functions that check if a string is valid. In this example, the string was invalid in its declared encoding. From the viewpoint of R implementation (or of external software), some operations such as substring can be carried out in a well defined way even on strings with invalid characters or characters invalid in specific ways, usually only in some encodings (e.g. UTF-8), and the implementation is then more complicated. Some operations can't be well defined on such strings. It may seem it would make sense to ban all invalid strings (not allow their creation) as not to mask errors like the one you have encountered, but it is sometimes better for debugging to be able to include invalid strings in error and diagnostic messages. Moreover, some systems support invalid strings in some operations also as they may appear in file names. On Windows, file names may include unpaired UTF-16 surrogates, which can't be represented in UTF-8. Some systems allow representing invalid strings in a custom way that is a valid string but preserves the information, only in some encodings (e.g. in UTF-8). So differences in how invalid strings are treated by different R functions are to be expected. The same applies to differences wrt to external software. Some may be optimized for UTF-8 and support invalid strings in more cases (R does not support substring on invalid strings), of course other may have bugs or intentionally may not check strings for validity when that is perceived too slow in given operation. Best Tomas On 6/28/20 12:38 AM, Toby Hocking wrote: Thanks for the quick response Ivan. readLines with encoding='latin1' works for me (on Ubuntu). However I was more concerned with the inconsistency in results between substr and regexpr. I was expecting that if one of them errors because of an unknown encoding then the other should as well. Even better, if regexpr works, why shouldn't substr work as well? Incidentally the analogous stringi function stri_sub works fine in this case: stringi::stri_sub("Jens Oehlschl\xe4gel-Akiyoshi", 1, 100) [1] "Jens Oehlschl\xe4gel-Akiyoshi" But the stringi analog to nchar gives a similar warning: stringi::stri_length("Jens Oehlschl\xe4gel-Akiyoshi") [1] NA Warning message: In stringi::stri_length("Jens Oehlschl\xe4gel-Akiyoshi") : invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8() On Sat, Jun 27, 2020 at 2:12 AM Ivan Krylov wrote: On Fri, 26 Jun 2020 15:57:06 -0700 Toby Hocking wrote: invalid multibyte string at 'gel-A<6b>iyoshi' https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html The server says that the text is UTF-8: curl -sI \ https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html | \ grep Content-Type # Content-Type: text/html; charset=UTF-8 But it's not, at least not all of it. If you ask readLines to mark the text as Latin-1, you get Jens Oehlschlägel-Akiyoshi without the mojibake and invalid multi-byte characters: x <- readLines( 'https://stat.ethz.ch/pipermail/r-devel/1999-November/author.html', encoding = 'latin1' )[28] substr(x, 1, 100) # [1] "Jens Oehlschlägel-Akiyoshi" The behaviour we observe when encoding = 'latin1' is not specified results from returned lines having "unknown" encoding. The substr() implementation tries to interpret such strings according to multi-byte C locale rules (using mbrtowc(3)). On my system (yours too, probably, if it's GNU/Linux or macOS), the multi-byte C locale encoding is UTF-8, and this Latin-1 string does not result in valid code points when decoded as UTF-8. -- Best regards, Ivan [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel