Re: [Rd] Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?
Hi Tomas, Sorry for the false alarm! I did some further testing, and you were right. There was no regression. I suspected it was a regression because the user who reported the issue said his code worked in R 3.6 but not 4.0. I should have tested it more carefully by myself. After I tested it again with the German locale and Chinese locale, respectively, I found that the code worked for both versions of R in the German locale, and failed in the Chinese locale. Your explanation makes perfect sense to me. I have also read your blog post when it came out last month, and I'm really looking forward to the end of this character encoding pain! Thank you very much for the hard work! Regards, Yihui -- https://yihui.org On Mon, Jun 22, 2020 at 3:37 AM Tomas Kalibera wrote: > > Hi Yihui, > > list.files() returns file names converted to native encoding by Windows, > so one needs to use only characters representable in current native > encoding for file names. If one wants to be safe, it makes sense to be > much stricter than that (only ASCII, and only a subset of it, there is a > number of recommendations that can be found online). Using more than > that is asking for trouble. > > Unicode "\u00e4" is a Latin-1 character, so representable in CP1252. On > my Windows running in CP1252 as C locale and system code page, your > example works fine, file.exists() returns TRUE, and this is the expected > behavior (tested in R-devel and R4.0). > > Your example was run in CP1252 as C locale but CP936 as the system code > page (see the sessionInfo() output). On Windows, unfortunately, there > are two different "current locales" at a time. With your settings > (CP1252 as C locale and CP936 as system code page), I get the same > results as you, file.exists() returns FALSE. enc2native(z) works fine > and returns a valid Latin-1 string, but that is because here "native" is > CP1252. Windows API functions and consequently some C library functions > that return strings from the OS, however, convert to the encoding from > the system code page, which is CP936 and it cannot represent "ä". So, > currently the behavior you are reporting is expected for R 4.0 and > earlier. I don't think this is a regression, it couldn't have worked > before, either - and I've tested in 3.6.3 and 3.4.3 on my system. > > These problems will go away when UTF-8 is both the current native > encoding for the C locale and the system code page. This is possible in > recent Windows 10, but requires UCRT and hence a new toolchain to build > R, and requires all packages and libraries to be rebuilt from source. > More details on my blog, also there is experimental build of R > (installer) and experimental toolchain available: > https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html > > Best > Tomas > > > On 6/22/20 6:11 AM, Yihui Xie wrote: > > Hi Tomas, > > > > I received a report about R 4.0.0 in the knitr package > > (https://github.com/yihui/knitr/issues/1840), and I think it is > > related to the issue here. I created a minimal reproducible example > > below: > > > > owd = setwd(tempdir()) > > z = 'K\u00e4sch.txt' > > file.create(z) > > list.files() > > file.exists(list.files()) > > setwd(owd) > > > > Output: > > > >> owd = setwd(tempdir()) > >> z = 'K\u00e4sch.txt' > >> file.create(z) > > [1] TRUE > >> list.files() > > [1] "K?sch.txt" > >> file.exists(list.files()) > > [1] FALSE > >> setwd(owd) > > I wonder if it is expected that file.exists() returns FALSE here. > > > >> sessionInfo() > > R version 4.0.1 (2020-06-06) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > Running under: Windows 7 x64 (build 7601) Service Pack 1 > > > > locale: > > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > > States.1252 > > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United States.1252 > > system code page: 936 > > > > FWIW, I also tested Chinese characters in the variable `z` above, and > > file.exists() returns TRUE only after I Sys.setlocale(, "Chinese"). > > > > Regards, > > Yihui > > > > On Thu, Jun 11, 2020 at 3:11 AM Tomas Kalibera > > wrote: > >> > >> Dear Juan, > >> > >> I don't see what is the problem from your report. Please try to create a > >> minimal but complete reproducible example that does not use the renv > >> package. Perhaps you could use the R debugger (e.g. via > >> options(error=recover)) to find out what is the argument that > >> file.exists() has been called with. And then you could try just to call > >> file.exists() directly with that argument to trigger the problem. > >> > >> It may be that the argument has been corrupted/is invalid in the current > >> native encoding. If that is the case, the next step would be to find out > >> who corrupted it (renv, R, something else). The error is displayed when > >> a path name cannot be converted from the current native encoding to > >> UTF16-LE. > >> > >> The experimental support for UTF-8 as native encoding on Wi
Re: [Rd] Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?
Hi Yihui, list.files() returns file names converted to native encoding by Windows, so one needs to use only characters representable in current native encoding for file names. If one wants to be safe, it makes sense to be much stricter than that (only ASCII, and only a subset of it, there is a number of recommendations that can be found online). Using more than that is asking for trouble. Unicode "\u00e4" is a Latin-1 character, so representable in CP1252. On my Windows running in CP1252 as C locale and system code page, your example works fine, file.exists() returns TRUE, and this is the expected behavior (tested in R-devel and R4.0). Your example was run in CP1252 as C locale but CP936 as the system code page (see the sessionInfo() output). On Windows, unfortunately, there are two different "current locales" at a time. With your settings (CP1252 as C locale and CP936 as system code page), I get the same results as you, file.exists() returns FALSE. enc2native(z) works fine and returns a valid Latin-1 string, but that is because here "native" is CP1252. Windows API functions and consequently some C library functions that return strings from the OS, however, convert to the encoding from the system code page, which is CP936 and it cannot represent "ä". So, currently the behavior you are reporting is expected for R 4.0 and earlier. I don't think this is a regression, it couldn't have worked before, either - and I've tested in 3.6.3 and 3.4.3 on my system. These problems will go away when UTF-8 is both the current native encoding for the C locale and the system code page. This is possible in recent Windows 10, but requires UCRT and hence a new toolchain to build R, and requires all packages and libraries to be rebuilt from source. More details on my blog, also there is experimental build of R (installer) and experimental toolchain available: https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html Best Tomas On 6/22/20 6:11 AM, Yihui Xie wrote: Hi Tomas, I received a report about R 4.0.0 in the knitr package (https://github.com/yihui/knitr/issues/1840), and I think it is related to the issue here. I created a minimal reproducible example below: owd = setwd(tempdir()) z = 'K\u00e4sch.txt' file.create(z) list.files() file.exists(list.files()) setwd(owd) Output: owd = setwd(tempdir()) z = 'K\u00e4sch.txt' file.create(z) [1] TRUE list.files() [1] "K?sch.txt" file.exists(list.files()) [1] FALSE setwd(owd) I wonder if it is expected that file.exists() returns FALSE here. sessionInfo() R version 4.0.1 (2020-06-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 system code page: 936 FWIW, I also tested Chinese characters in the variable `z` above, and file.exists() returns TRUE only after I Sys.setlocale(, "Chinese"). Regards, Yihui On Thu, Jun 11, 2020 at 3:11 AM Tomas Kalibera wrote: Dear Juan, I don't see what is the problem from your report. Please try to create a minimal but complete reproducible example that does not use the renv package. Perhaps you could use the R debugger (e.g. via options(error=recover)) to find out what is the argument that file.exists() has been called with. And then you could try just to call file.exists() directly with that argument to trigger the problem. It may be that the argument has been corrupted/is invalid in the current native encoding. If that is the case, the next step would be to find out who corrupted it (renv, R, something else). The error is displayed when a path name cannot be converted from the current native encoding to UTF16-LE. The experimental support for UTF-8 as native encoding on Windows 10 is only available in a custom build of R, like the one I linked from my blog post. Thanks Tomas On 6/10/20 1:06 PM, Juan Telleria Ruiz de Aguirre wrote: Dear R Developers, I am having an issue with the renv package and R 4.0.1, which I suspect is related to base R and not the renv package itself, as with R 3.6.3 such an "error" does not appear. The error is raised by a file.exists() path, and path "C:\Users\J-tel\Documents", which in R 3.6.3 is read correctly, but in R 4.0.1 fails (Probably because of the "-" symbol), and I suspect it might be related with the new UTF-8 usage on Windows 10? (https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html) I have also checked file.exists() function and its internals, and seem not to have happened changes in the meanwhile within them: https://github.com/wch/r-source/blob/0e3b3182f87a60af4b0293a5410dde680b910f49/src/library/base/R/files.R https://github.com/search?q=SEXP%20attribute_hidden%20do_fileexists+repo:wch/r-source&type=Code Error Details: renv::init() Error in
Re: [Rd] Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?
Hi Tomas, I received a report about R 4.0.0 in the knitr package (https://github.com/yihui/knitr/issues/1840), and I think it is related to the issue here. I created a minimal reproducible example below: owd = setwd(tempdir()) z = 'K\u00e4sch.txt' file.create(z) list.files() file.exists(list.files()) setwd(owd) Output: > owd = setwd(tempdir()) > z = 'K\u00e4sch.txt' > file.create(z) [1] TRUE > list.files() [1] "K?sch.txt" > file.exists(list.files()) [1] FALSE > setwd(owd) I wonder if it is expected that file.exists() returns FALSE here. > sessionInfo() R version 4.0.1 (2020-06-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 system code page: 936 FWIW, I also tested Chinese characters in the variable `z` above, and file.exists() returns TRUE only after I Sys.setlocale(, "Chinese"). Regards, Yihui On Thu, Jun 11, 2020 at 3:11 AM Tomas Kalibera wrote: > > > Dear Juan, > > I don't see what is the problem from your report. Please try to create a > minimal but complete reproducible example that does not use the renv > package. Perhaps you could use the R debugger (e.g. via > options(error=recover)) to find out what is the argument that > file.exists() has been called with. And then you could try just to call > file.exists() directly with that argument to trigger the problem. > > It may be that the argument has been corrupted/is invalid in the current > native encoding. If that is the case, the next step would be to find out > who corrupted it (renv, R, something else). The error is displayed when > a path name cannot be converted from the current native encoding to > UTF16-LE. > > The experimental support for UTF-8 as native encoding on Windows 10 is > only available in a custom build of R, like the one I linked from my > blog post. > > Thanks > Tomas > > > > On 6/10/20 1:06 PM, Juan Telleria Ruiz de Aguirre wrote: > > Dear R Developers, > > > > I am having an issue with the renv package and R 4.0.1, which I > > suspect is related to base R and not the renv package itself, as with > > R 3.6.3 such an "error" does not appear. > > > > The error is raised by a file.exists() path, and path > > "C:\Users\J-tel\Documents", which in R 3.6.3 is read correctly, but in > > R 4.0.1 fails (Probably because of the "-" symbol), and I suspect it > > might be related with the new UTF-8 usage on Windows 10? > > (https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html) > > > > I have also checked file.exists() function and its internals, and seem > > not to have happened changes in the meanwhile within them: > > > > https://github.com/wch/r-source/blob/0e3b3182f87a60af4b0293a5410dde680b910f49/src/library/base/R/files.R > > https://github.com/search?q=SEXP%20attribute_hidden%20do_fileexists+repo:wch/r-source&type=Code > > > > Error Details: > > > >> renv::init() > > Error in file.exists(children) : > >file name conversion problem -- name too long? > >> traceback() > > 14: file.exists(children) > > 13: renv_dependencies_find_dir_children(path, root) > > 12: renv_dependencies_find_dir(path, root) > > 11: FUN(X[[i]], ...) > > 10: lapply(path, renv_dependencies_find_impl, root = root) > > 9: renv_dependencies_find(path, root) > > 8: (function (path = getwd(), root = NULL, ..., progress = TRUE, > > errors = c("reported", "fatal", "ignored"), dev = FALSE) > > { > > path <- renv_path_normalize(path, winslash = "/", mustWork = TRUE) > > root <- root %||% renv_dependencies_root(path) > > if (exists(path, envir = `_renv_dependencies`)) > > return(get(path, envir = `_renv_dependencies`)) > > renv_dependencies_begin(root = root) > > on.exit(renv_dependencies_end(), add = TRUE) > > dots <- list(...) > > if (identical(dots[["quiet"]], TRUE)) { > > progress <- FALSE > > errors <- "ignored" > > } > > files <- renv_dependencies_find(path, root) > > deps <- renv_dependencies_discover(files, progress, errors) > > renv_dependencies_report(errors) > > deps > > })(path, progress = FALSE, errors = errors, dev = TRUE) > > 7: eval(call, envir = parent.frame(2)) > > 6: eval(call, envir = parent.frame(2)) > > 5: delegate(renv_dependencies_impl) > > 4: dependencies(path, progress = FALSE, errors = errors, dev = TRUE) > > 3: withCallingHandlers(dependencies(path, progress = FALSE, errors = errors, > > dev = TRUE), renv.dependencies.error = > > renv_dependencies_error_handler(message, > > errors)) > > 2: renv_dependencies_scope(project, action = "init") > > 1: renv::init() > > > >> renv::diagnostics() > > Diagnostics Report -- renv [0.10.0] > > === > > > > # Session Info ===
Re: [Rd] Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?
Dear Juan, I don't see what is the problem from your report. Please try to create a minimal but complete reproducible example that does not use the renv package. Perhaps you could use the R debugger (e.g. via options(error=recover)) to find out what is the argument that file.exists() has been called with. And then you could try just to call file.exists() directly with that argument to trigger the problem. It may be that the argument has been corrupted/is invalid in the current native encoding. If that is the case, the next step would be to find out who corrupted it (renv, R, something else). The error is displayed when a path name cannot be converted from the current native encoding to UTF16-LE. The experimental support for UTF-8 as native encoding on Windows 10 is only available in a custom build of R, like the one I linked from my blog post. Thanks Tomas On 6/10/20 1:06 PM, Juan Telleria Ruiz de Aguirre wrote: Dear R Developers, I am having an issue with the renv package and R 4.0.1, which I suspect is related to base R and not the renv package itself, as with R 3.6.3 such an "error" does not appear. The error is raised by a file.exists() path, and path "C:\Users\J-tel\Documents", which in R 3.6.3 is read correctly, but in R 4.0.1 fails (Probably because of the "-" symbol), and I suspect it might be related with the new UTF-8 usage on Windows 10? (https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html) I have also checked file.exists() function and its internals, and seem not to have happened changes in the meanwhile within them: https://github.com/wch/r-source/blob/0e3b3182f87a60af4b0293a5410dde680b910f49/src/library/base/R/files.R https://github.com/search?q=SEXP%20attribute_hidden%20do_fileexists+repo:wch/r-source&type=Code Error Details: renv::init() Error in file.exists(children) : file name conversion problem -- name too long? traceback() 14: file.exists(children) 13: renv_dependencies_find_dir_children(path, root) 12: renv_dependencies_find_dir(path, root) 11: FUN(X[[i]], ...) 10: lapply(path, renv_dependencies_find_impl, root = root) 9: renv_dependencies_find(path, root) 8: (function (path = getwd(), root = NULL, ..., progress = TRUE, errors = c("reported", "fatal", "ignored"), dev = FALSE) { path <- renv_path_normalize(path, winslash = "/", mustWork = TRUE) root <- root %||% renv_dependencies_root(path) if (exists(path, envir = `_renv_dependencies`)) return(get(path, envir = `_renv_dependencies`)) renv_dependencies_begin(root = root) on.exit(renv_dependencies_end(), add = TRUE) dots <- list(...) if (identical(dots[["quiet"]], TRUE)) { progress <- FALSE errors <- "ignored" } files <- renv_dependencies_find(path, root) deps <- renv_dependencies_discover(files, progress, errors) renv_dependencies_report(errors) deps })(path, progress = FALSE, errors = errors, dev = TRUE) 7: eval(call, envir = parent.frame(2)) 6: eval(call, envir = parent.frame(2)) 5: delegate(renv_dependencies_impl) 4: dependencies(path, progress = FALSE, errors = errors, dev = TRUE) 3: withCallingHandlers(dependencies(path, progress = FALSE, errors = errors, dev = TRUE), renv.dependencies.error = renv_dependencies_error_handler(message, errors)) 2: renv_dependencies_scope(project, action = "init") 1: renv::init() renv::diagnostics() Diagnostics Report -- renv [0.10.0] === # Session Info === R version 4.0.1 (2020-06-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362) Matrix products: default locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C [5] LC_TIME=Spanish_Spain.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] renv_0.10.0 loaded via a namespace (and not attached): [1] compiler_4.0.1 rsconnect_0.8.16 htmltools_0.4.0 tools_4.0.1 [5] yaml_2.2.1 Rcpp_1.0.4.6 rmarkdown_2.2knitr_1.28 [9] xfun_0.14digest_0.6.25packrat_0.5.0rlang_0.4.6 [13] evaluate_0.14 # Project Project path: "~/Test2" # Status = # Lockfile === This project has not yet been snapshotted: 'renv.lock' does not exist. # Library The project library "~/Test2/renv/library/R-4.0/x86_64-w64-mingw32" does not exist. # Dependencies === # User Profile === [no user profile detected] # Settings === List of 6 $ external.libraries : chr(0) $ ignored.packages : chr(0) $ package.dependency.fields: chr [1:3] "Imports" "Depends" "LinkingTo" $ snapshot.type: chr "
Re: [Rd] Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?
Thank you Kevin, just checked that the error is solved in the latest development version of "renv", and now it works as expected with R 4.0.1: https://github.com/rstudio/renv/commit/976ae7af6dc348af30eaf2893d886f132a76aba0 Sorry for posting in r-devel, I was not sure if it was a R or "renv" error due to different behaviour in different versions of R 4.0.1 and R 3.6.3 for conversion from UTF16-LE to UTF-8 encoding. Will provide a better reproducible example next time and traceback the error with options(error=recover)) to make sure. Thanks, Juan __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?
Hi Juan, For bug reports to R, you should attempt to create a minimally-reproducible example, using only R's builtin facilities and not any other addon packages. Given your report, it's not clear whether the issue lies within renv or truly is caused by a change in R 4.0.0. Also note that you have not supplied a minimally reproducible example. If at all possible, you should be able to supply some code that reproduces the issue -- ideally, one should be able to just copy + paste the code into an R session to see the issue arise. Presumably, if the issue is indeed in base R, then you should be able to supply a reproducible example of the form: path <- "path/that/causes/issue" file.exists(path) Alternatively, if you can distill this into a minimally-reproducible example that does require renv, then you should report that to the maintainer of renv (me), not this mailing list. Best, Kevin On Wed, Jun 10, 2020 at 4:55 AM Dirk Eddelbuettel wrote: > > > On 10 June 2020 at 13:06, Juan Telleria Ruiz de Aguirre wrote: > | I am having an issue with the renv package and R 4.0.1, which I > | suspect is related to base R and not the renv package itself, as with > | R 3.6.3 such an "error" does not appear. > > So a bug in `renv` as it does not account for changes in R 4.0.0 ? > > Stuff happens. I just fixed an 'change in R 4.0.0' for one small aspect of > Rcpp(Armadillo) (namely the change in package.skeleton() and NAMESPACE). > > Dirk > > -- > http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Possible Bug: file.exists() Function. Due to UTF-8 Encoding differences on Windows between R 4.0.1 and R 3.6.3?
On 10 June 2020 at 13:06, Juan Telleria Ruiz de Aguirre wrote: | I am having an issue with the renv package and R 4.0.1, which I | suspect is related to base R and not the renv package itself, as with | R 3.6.3 such an "error" does not appear. So a bug in `renv` as it does not account for changes in R 4.0.0 ? Stuff happens. I just fixed an 'change in R 4.0.0' for one small aspect of Rcpp(Armadillo) (namely the change in package.skeleton() and NAMESPACE). Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel