Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
I do not blame anybody and I do have a huge respect to all authors of R. Actually, I like R very much and I would like to thank to everyone who contributes to it. I use R regularly in my work (moved from Java, C# and Matlab), I have created a package rPraat for phonetic analyses and I think R is a very well designed language which will survive decades. I am trying to bring new users (my students at non-technical University) to use programming for their everyday problems (statistics, phonetic analyses, text processing) and they enjoy R. I am really positive in this (it is hard to express emotions in e-mails without using emoticons in every sentence). And that is why I would like it have even more perfect. I only suggest to add one line of code (metaphorically) in source() function in R for Windows to make it even better and to warn all users who do not read a whole documentation for each function thoroughly and carefully. Tomas On Thu, Apr 11, 2019 at 9:54 AM Tomas Kalibera wrote: > > On 4/11/19 9:10 AM, Tomáš Bořil wrote: > > Or, if this cannot be done easily, please, disable the "utf-8" value > > in source(..., ) function on Windows R. > > source(..., encoding = "utf-8") > > -> error: "utf-8" does not work right on Windows. > > -> (or, at least) warning: "utf-8" is handled by "best fit" on Windows > > and some characters in string literals may be automatically changed. > > > > Because, at this state, the UTF-8 encoding of R source files on > > Windows is a fake Unicode as it can handle only 256 different ANSI > > characters in reality. > > This is not a fair statement. source(,encoding="UTF-8") works as > documented. It translates from (full) UTF-8 to current native encoding, > which is documented. I believe the authors who made these design > decisions over a decade ago, under different circumstances, and > carefully implemented the code, tested, and documented for you to use > for free, deserve to be addressed with some respect. It is not their > responsibility to read the documentation for you, and if you had read > and understood it, you would not have used source(,encoding="UTF-8") > with characters not representable in current native encoding on Windows. > The authors should not be blamed for that the design _today_ does not > seem perfect for _todays_ systems (and how could they have guessed at > that time Windows will still not support UTF-8 as native encoding today). > > Tomas > > Thanks, > > Tomas > > > > > > On Thu, Apr 11, 2019 at 8:53 AM Tomáš Bořil wrote: > >> For me, this would be a perfect solution. > >> > >> I.e., do not use the “best” fit and leave it to user’s competence: > >> a) in some functions, utf-8 works > >> b) in others -> error is thrown (e.g., incomplete string, NA, etc.) > >> => user has to change the code with his/her intentional “best fit string > >> literal substitute” or use another function that can handle utf-8. > >> > >> Making an R code working right only on some platforms / trying to keep a > >> back-compatibility meaning “the code does not do what you want and the > >> behaviour differs depending on each every locale but at least, it does not > >> throw an error” is generally not a good idea - it is dangerous. Users / > >> coders should know that there is something wrong with their strings and > >> some characters are “eaten alive”. > >> > >> Tomas > >> > >> čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera > >> napsal: > >>> On 4/10/19 6:32 PM, Jeroen Ooms wrote: > On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch > wrote: > > On 10/04/2019 10:29 a.m., Yihui Xie wrote: > >> Since it is "technically easy" to disable the best fit conversion and > >> the best fit is rarely good, how about providing an option for > >> code/package authors to disable it? I'm asking because this is one of > >> the most painful issues in packages that may need to source() code > >> containing UTF-8 characters that are not representable in the Windows > >> native encoding. Examples include knitr/rmarkdown and shiny. Basically > >> users won't be able to knit documents or run Shiny apps correctly when > >> the code contains characters that cannot be represented in the native > >> encoding. > > Wouldn't things be worse with it disabled than currently? I'd expect > > the line containing the "ř" to end up as NA instead of converting to > > "r". > I don't think it would be worse, because in this case R would not > implicitly convert strings to (best fit) latin1 on Windows, but > instead keep the (correct) string in its UTF-8 encoding. The NA only > appears if the user explicitly forces a conversion to latin1, which is > not the problem here I think. > > The original problem that I can reproduce in RGui is that if you enter > "ř" in RGui, R opportunistically converts this to latin1, because it > can. However if you enter text which can definitely not be represented >
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 4/11/19 9:10 AM, Tomáš Bořil wrote: Or, if this cannot be done easily, please, disable the "utf-8" value in source(..., ) function on Windows R. source(..., encoding = "utf-8") -> error: "utf-8" does not work right on Windows. -> (or, at least) warning: "utf-8" is handled by "best fit" on Windows and some characters in string literals may be automatically changed. Because, at this state, the UTF-8 encoding of R source files on Windows is a fake Unicode as it can handle only 256 different ANSI characters in reality. This is not a fair statement. source(,encoding="UTF-8") works as documented. It translates from (full) UTF-8 to current native encoding, which is documented. I believe the authors who made these design decisions over a decade ago, under different circumstances, and carefully implemented the code, tested, and documented for you to use for free, deserve to be addressed with some respect. It is not their responsibility to read the documentation for you, and if you had read and understood it, you would not have used source(,encoding="UTF-8") with characters not representable in current native encoding on Windows. The authors should not be blamed for that the design _today_ does not seem perfect for _todays_ systems (and how could they have guessed at that time Windows will still not support UTF-8 as native encoding today). Tomas Thanks, Tomas On Thu, Apr 11, 2019 at 8:53 AM Tomáš Bořil wrote: For me, this would be a perfect solution. I.e., do not use the “best” fit and leave it to user’s competence: a) in some functions, utf-8 works b) in others -> error is thrown (e.g., incomplete string, NA, etc.) => user has to change the code with his/her intentional “best fit string literal substitute” or use another function that can handle utf-8. Making an R code working right only on some platforms / trying to keep a back-compatibility meaning “the code does not do what you want and the behaviour differs depending on each every locale but at least, it does not throw an error” is generally not a good idea - it is dangerous. Users / coders should know that there is something wrong with their strings and some characters are “eaten alive”. Tomas čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera napsal: On 4/10/19 6:32 PM, Jeroen Ooms wrote: On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch wrote: On 10/04/2019 10:29 a.m., Yihui Xie wrote: Since it is "technically easy" to disable the best fit conversion and the best fit is rarely good, how about providing an option for code/package authors to disable it? I'm asking because this is one of the most painful issues in packages that may need to source() code containing UTF-8 characters that are not representable in the Windows native encoding. Examples include knitr/rmarkdown and shiny. Basically users won't be able to knit documents or run Shiny apps correctly when the code contains characters that cannot be represented in the native encoding. Wouldn't things be worse with it disabled than currently? I'd expect the line containing the "ř" to end up as NA instead of converting to "r". I don't think it would be worse, because in this case R would not implicitly convert strings to (best fit) latin1 on Windows, but instead keep the (correct) string in its UTF-8 encoding. The NA only appears if the user explicitly forces a conversion to latin1, which is not the problem here I think. The original problem that I can reproduce in RGui is that if you enter "ř" in RGui, R opportunistically converts this to latin1, because it can. However if you enter text which can definitely not be represented in latin1, R encodes the string correctly in UTF-8 form. Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to convert the input to native encoding before passing it to R, which is based on locales. However, that string is passed by R to the parser, which Rgui takes advantage of and converts non-representable characters to their \u escapes which are understood by the parser. Using this trick, Unicode characters can get to the parser from Rgui (but of course then still in risk of conversion later when the program runs). Rgui only escapes characters that cannot be represented, unfortunately, the standard C99 API for that implemented on Windows does the best fit. This could be fixed in Rgui by calling a special Windows API function and could be done, but with the mentioned risk that it would break existing uses that capture the existing behavior. This is the only place I know of where removing best fit would lead to correct representation of UTF-8 characters. Other places will give NA, some other escapes, code will fail to parse (e.g. "incomplete string", one can get that easily with source()). Tomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
Or, if this cannot be done easily, please, disable the "utf-8" value in source(..., ) function on Windows R. source(..., encoding = "utf-8") -> error: "utf-8" does not work right on Windows. -> (or, at least) warning: "utf-8" is handled by "best fit" on Windows and some characters in string literals may be automatically changed. Because, at this state, the UTF-8 encoding of R source files on Windows is a fake Unicode as it can handle only 256 different ANSI characters in reality. Thanks, Tomas On Thu, Apr 11, 2019 at 8:53 AM Tomáš Bořil wrote: > > For me, this would be a perfect solution. > > I.e., do not use the “best” fit and leave it to user’s competence: > a) in some functions, utf-8 works > b) in others -> error is thrown (e.g., incomplete string, NA, etc.) > => user has to change the code with his/her intentional “best fit string > literal substitute” or use another function that can handle utf-8. > > Making an R code working right only on some platforms / trying to keep a > back-compatibility meaning “the code does not do what you want and the > behaviour differs depending on each every locale but at least, it does not > throw an error” is generally not a good idea - it is dangerous. Users / > coders should know that there is something wrong with their strings and some > characters are “eaten alive”. > > Tomas > > čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera > napsal: >> >> On 4/10/19 6:32 PM, Jeroen Ooms wrote: >> > On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch >> > wrote: >> >> On 10/04/2019 10:29 a.m., Yihui Xie wrote: >> >>> Since it is "technically easy" to disable the best fit conversion and >> >>> the best fit is rarely good, how about providing an option for >> >>> code/package authors to disable it? I'm asking because this is one of >> >>> the most painful issues in packages that may need to source() code >> >>> containing UTF-8 characters that are not representable in the Windows >> >>> native encoding. Examples include knitr/rmarkdown and shiny. Basically >> >>> users won't be able to knit documents or run Shiny apps correctly when >> >>> the code contains characters that cannot be represented in the native >> >>> encoding. >> >> Wouldn't things be worse with it disabled than currently? I'd expect >> >> the line containing the "ř" to end up as NA instead of converting to "r". >> > I don't think it would be worse, because in this case R would not >> > implicitly convert strings to (best fit) latin1 on Windows, but >> > instead keep the (correct) string in its UTF-8 encoding. The NA only >> > appears if the user explicitly forces a conversion to latin1, which is >> > not the problem here I think. >> > >> > The original problem that I can reproduce in RGui is that if you enter >> > "ř" in RGui, R opportunistically converts this to latin1, because it >> > can. However if you enter text which can definitely not be represented >> > in latin1, R encodes the string correctly in UTF-8 form. >> >> Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to >> convert the input to native encoding before passing it to R, which is >> based on locales. However, that string is passed by R to the parser, >> which Rgui takes advantage of and converts non-representable characters >> to their \u escapes which are understood by the parser. Using this >> trick, Unicode characters can get to the parser from Rgui (but of course >> then still in risk of conversion later when the program runs). Rgui only >> escapes characters that cannot be represented, unfortunately, the >> standard C99 API for that implemented on Windows does the best fit. This >> could be fixed in Rgui by calling a special Windows API function and >> could be done, but with the mentioned risk that it would break existing >> uses that capture the existing behavior. >> >> This is the only place I know of where removing best fit would lead to >> correct representation of UTF-8 characters. Other places will give NA, >> some other escapes, code will fail to parse (e.g. "incomplete string", >> one can get that easily with source()). >> >> Tomas >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
For me, this would be a perfect solution. I.e., do not use the “best” fit and leave it to user’s competence: a) in some functions, utf-8 works b) in others -> error is thrown (e.g., incomplete string, NA, etc.) => user has to change the code with his/her intentional “best fit string literal substitute” or use another function that can handle utf-8. Making an R code working right only on some platforms / trying to keep a back-compatibility meaning “the code does not do what you want and the behaviour differs depending on each every locale but at least, it does not throw an error” is generally not a good idea - it is dangerous. Users / coders should know that there is something wrong with their strings and some characters are “eaten alive”. Tomas čt 11. 4. 2019 v 8:26 odesílatel Tomas Kalibera napsal: > On 4/10/19 6:32 PM, Jeroen Ooms wrote: > > On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch > wrote: > >> On 10/04/2019 10:29 a.m., Yihui Xie wrote: > >>> Since it is "technically easy" to disable the best fit conversion and > >>> the best fit is rarely good, how about providing an option for > >>> code/package authors to disable it? I'm asking because this is one of > >>> the most painful issues in packages that may need to source() code > >>> containing UTF-8 characters that are not representable in the Windows > >>> native encoding. Examples include knitr/rmarkdown and shiny. Basically > >>> users won't be able to knit documents or run Shiny apps correctly when > >>> the code contains characters that cannot be represented in the native > >>> encoding. > >> Wouldn't things be worse with it disabled than currently? I'd expect > >> the line containing the "ř" to end up as NA instead of converting to > "r". > > I don't think it would be worse, because in this case R would not > > implicitly convert strings to (best fit) latin1 on Windows, but > > instead keep the (correct) string in its UTF-8 encoding. The NA only > > appears if the user explicitly forces a conversion to latin1, which is > > not the problem here I think. > > > > The original problem that I can reproduce in RGui is that if you enter > > "ř" in RGui, R opportunistically converts this to latin1, because it > > can. However if you enter text which can definitely not be represented > > in latin1, R encodes the string correctly in UTF-8 form. > > Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to > convert the input to native encoding before passing it to R, which is > based on locales. However, that string is passed by R to the parser, > which Rgui takes advantage of and converts non-representable characters > to their \u escapes which are understood by the parser. Using this > trick, Unicode characters can get to the parser from Rgui (but of course > then still in risk of conversion later when the program runs). Rgui only > escapes characters that cannot be represented, unfortunately, the > standard C99 API for that implemented on Windows does the best fit. This > could be fixed in Rgui by calling a special Windows API function and > could be done, but with the mentioned risk that it would break existing > uses that capture the existing behavior. > > This is the only place I know of where removing best fit would lead to > correct representation of UTF-8 characters. Other places will give NA, > some other escapes, code will fail to parse (e.g. "incomplete string", > one can get that easily with source()). > > Tomas > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 4/10/19 6:32 PM, Jeroen Ooms wrote: On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch wrote: On 10/04/2019 10:29 a.m., Yihui Xie wrote: Since it is "technically easy" to disable the best fit conversion and the best fit is rarely good, how about providing an option for code/package authors to disable it? I'm asking because this is one of the most painful issues in packages that may need to source() code containing UTF-8 characters that are not representable in the Windows native encoding. Examples include knitr/rmarkdown and shiny. Basically users won't be able to knit documents or run Shiny apps correctly when the code contains characters that cannot be represented in the native encoding. Wouldn't things be worse with it disabled than currently? I'd expect the line containing the "ř" to end up as NA instead of converting to "r". I don't think it would be worse, because in this case R would not implicitly convert strings to (best fit) latin1 on Windows, but instead keep the (correct) string in its UTF-8 encoding. The NA only appears if the user explicitly forces a conversion to latin1, which is not the problem here I think. The original problem that I can reproduce in RGui is that if you enter "ř" in RGui, R opportunistically converts this to latin1, because it can. However if you enter text which can definitely not be represented in latin1, R encodes the string correctly in UTF-8 form. Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to convert the input to native encoding before passing it to R, which is based on locales. However, that string is passed by R to the parser, which Rgui takes advantage of and converts non-representable characters to their \u escapes which are understood by the parser. Using this trick, Unicode characters can get to the parser from Rgui (but of course then still in risk of conversion later when the program runs). Rgui only escapes characters that cannot be represented, unfortunately, the standard C99 API for that implemented on Windows does the best fit. This could be fixed in Rgui by calling a special Windows API function and could be done, but with the mentioned risk that it would break existing uses that capture the existing behavior. This is the only place I know of where removing best fit would lead to correct representation of UTF-8 characters. Other places will give NA, some other escapes, code will fail to parse (e.g. "incomplete string", one can get that easily with source()). Tomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 4/10/19 6:13 PM, Tomáš Bořil wrote: An optional parameter to source() function which would translate all UTF-8 characters in string literals to their "\U" codes sounds as a great idea (and I hope it would fix 99.9% of problems I have - because that is the way I overcome these problems nowadays) - and the same behaviour in command line... I was not suggesting to convert to \U in source(). Some users do it in their programs by hand or an external utility. Source() in principle could be made work similarly to eval(parse(file,encoding=)) with respect to encodings, via other means, we will consider that but there are many remaining places where the conversion happens - a trivial one is that currently you cannot print the result of the parse() from your example properly. Maybe you don't trigger such problems in your scripts in obvious ways, but as I said before, if you want to work reliably with characters not representable in current native encoding, in current or near version of R, use Linux or macOS. Tomas Tomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 10/04/2019 12:32 p.m., Jeroen Ooms wrote: On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch wrote: On 10/04/2019 10:29 a.m., Yihui Xie wrote: Since it is "technically easy" to disable the best fit conversion and the best fit is rarely good, how about providing an option for code/package authors to disable it? I'm asking because this is one of the most painful issues in packages that may need to source() code containing UTF-8 characters that are not representable in the Windows native encoding. Examples include knitr/rmarkdown and shiny. Basically users won't be able to knit documents or run Shiny apps correctly when the code contains characters that cannot be represented in the native encoding. Wouldn't things be worse with it disabled than currently? I'd expect the line containing the "ř" to end up as NA instead of converting to "r". I don't think it would be worse, because in this case R would not implicitly convert strings to (best fit) latin1 on Windows, but instead keep the (correct) string in its UTF-8 encoding. The NA only appears if the user explicitly forces a conversion to latin1, which is not the problem here I think. The original problem that I can reproduce in RGui is that if you enter "ř" in RGui, R opportunistically converts this to latin1, because it can. However if you enter text which can definitely not be represented in latin1, R encodes the string correctly in UTF-8 form. I think the pathways for text in RGui and text being sourced are different. I agree fixing RGui in that way would make sense, but Yihui was talking about source(). Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch wrote: > > On 10/04/2019 10:29 a.m., Yihui Xie wrote: > > Since it is "technically easy" to disable the best fit conversion and > > the best fit is rarely good, how about providing an option for > > code/package authors to disable it? I'm asking because this is one of > > the most painful issues in packages that may need to source() code > > containing UTF-8 characters that are not representable in the Windows > > native encoding. Examples include knitr/rmarkdown and shiny. Basically > > users won't be able to knit documents or run Shiny apps correctly when > > the code contains characters that cannot be represented in the native > > encoding. > > Wouldn't things be worse with it disabled than currently? I'd expect > the line containing the "ř" to end up as NA instead of converting to "r". I don't think it would be worse, because in this case R would not implicitly convert strings to (best fit) latin1 on Windows, but instead keep the (correct) string in its UTF-8 encoding. The NA only appears if the user explicitly forces a conversion to latin1, which is not the problem here I think. The original problem that I can reproduce in RGui is that if you enter "ř" in RGui, R opportunistically converts this to latin1, because it can. However if you enter text which can definitely not be represented in latin1, R encodes the string correctly in UTF-8 form. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
Yes, again in a script sourced by source(encoding = ...). But also by typing it directly in R console. Most of the time, I use RStudio as a front-end. For this experiment, I also verified it in Rgui. In both front-ends, it behaves completely in the same way. An optional parameter to source() function which would translate all UTF-8 characters in string literals to their "\U" codes sounds as a great idea (and I hope it would fix 99.9% of problems I have - because that is the way I overcome these problems nowadays) - and the same behaviour in command line... Tomas > What do you mean it is "converted before"? Under what context? Again a > script sourced by source(encoding=) ? > > And, are you using Rgui as front-end? >> The only problem is that I >> cannot simple use enc2utf8("œ") - it is converted to "o" before >> executing the function. Instead of that, I have to explicitly type >> "\U00159" throughout my code. On Wed, Apr 10, 2019 at 5:29 PM Tomas Kalibera wrote: > > On 4/10/19 3:02 PM, Tomáš Bořil wrote: > > The thing is, I would rather prefer R (in that rare occasions where an > > old function does not support anything but ANSI encoding) throwing an > > error: > > "Unicode encoding not supported, please change the string in your > > code" instead of silently converting some characters to different ones > > without any warning. > In principle it probably could be optional as Yihui Xie asks on R-devel, > we will discuss that internally. If the Windows "best fit" is a big > problem on its own, this is something that could be done quickly, if > optional. We could turn into error only conversions that we have control > of (inside R code), indeed, but that should be most. > > I understand that there are some functions which are not > > Unicode-compatible yet but according to the Stackoverflow discussion I > > cited before, in many cases (90% or more?) everything works right with > > Encoding("\U00159") == "UTF-8" (in my scripts, I have not found any > > problem with explicit UTF-8 coding yet). > > Well there has been a lot of effort invested to make that possible, so > that many internal string functions do not convert unnecessarily into > UTF-8, mostly by Duncan Murdoch, but much more needs to be done and > there is the problem with packages. Of course if you find a concrete R > function that unnecessarily converts (source() is debatable, I know > about it, so some other), you are welcome to report, I or someone can > fix. A common problem is I/O (connections) and there the fix won't be > easy, it would have to be re-designed. The problem is that when we have > something typed "char *" inside R, it needs to be always in native > encoding, any mix would lead to total chaos. > > The full solution would however only be fully switching to UTF-8 > internally on Windows (and then char * would always mean UTF-8), we have > discussed this many times inside R Core (and many times before I > joined), I am sure it will be discussed again at some point and we are > aware of course of the problem. Please trust us it is hard to do - we > know the code as we (collectively) have written it. People contributing > to SO are users and package developers, not developers of the core. You > can get more correct information from people on R-devel (package > developers and sometimes core developers). > > > The only problem is that I > > cannot simple use enc2utf8("œ") - it is converted to "o" before > > executing the function. Instead of that, I have to explicitly type > > "\U00159" throughout my code. > > What do you mean it is "converted before"? Under what context? Again a > script sourced by source(encoding=) ? > > And, are you using Rgui as front-end? > > > In my lectures, I have Czech, Russian and English students and it is > > also impossible to create a script that works for everyone. In fact, I > > know that Czech "ř" can be translated to my native (Czech) encoding. I > > have just chosen the example as it is reproducible in English locale. > > > > Originally, I had a problem with IPA characted (phonetic symbol) "œ", > > i.e. "\U00153". In Czech locale, it is translated to "o". In English, > > it is not converted - it remains "œ". But if I use "\U00153" in Czech > > locale, nothing is converted and everything works right. > > Yes, the \u* sequence I hear is commonly used to represent UTF-8 string > literals in something that is not UTF-8 itself. Note if you have a > package, you can have R source files with UTF-8 encoded literal strings > if you declare Encoding: UTF-8 in the DESCRIPTION file (see Writing R > Extensions for details), even though sometimes people run into > trouble/bugs as well. > > You probably know none of these problems exist on Linux nor macOS, where > UTF-8 is the native encoding. > > Tomas > > > > > Tomas > > > > > > > > On Wed, Apr 10, 2019 at 2:37 PM Tomas Kalibera > > wrote: > >> On 4/10/19 2:06 PM, Tomáš Bořil wrote: > >> > >> Thank you for the explanation but I just do not
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 10/04/2019 10:29 a.m., Yihui Xie wrote: Since it is "technically easy" to disable the best fit conversion and the best fit is rarely good, how about providing an option for code/package authors to disable it? I'm asking because this is one of the most painful issues in packages that may need to source() code containing UTF-8 characters that are not representable in the Windows native encoding. Examples include knitr/rmarkdown and shiny. Basically users won't be able to knit documents or run Shiny apps correctly when the code contains characters that cannot be represented in the native encoding. Wouldn't things be worse with it disabled than currently? I'd expect the line containing the "ř" to end up as NA instead of converting to "r". Of course, it would be best to be able to declare source files as UTF-8 and avoid any conversion at all, but as Tomas said, that's a lot harder. Duncan Murdoch Regards, Yihui -- https://yihui.name On Wed, Apr 10, 2019 at 6:36 AM Tomas Kalibera wrote: On 4/10/19 1:14 PM, Jeroen Ooms wrote: On Wed, Apr 10, 2019 at 12:19 PM Tomáš Bořil wrote: Minimalistic example: Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console: "ř" [1] "r" Although the script is in UTF-8, the characters are replaced by "simplified" substitutes uncontrollably (depending on OS locale). The same goes with simply entering the code statements in R Console. The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) I think this is a "feature" of win_iconv that is bundled with base R on Windows (./src/extra/win_iconv). The character from your example is not part of the latin1 (iso-8859-1) set, however, win-iconv seems to do so anyway: x <- "\U00159" print(x) [1] "ř" iconv(x, 'UTF-8', 'iso-8859-1') [1] "r" On MacOS, iconv tells us this character cannot be represented as latin1: x <- "\U00159" print(x) [1] "ř" iconv(x, 'UTF-8', 'iso-8859-1') [1] NA I'm actually not sure why base-R needs win_iconv (but I'm not an encoding expert at all). Perhaps we could try to unbundle it and use the standard libiconv provided by the Rtools toolchain bundle to get more consistent results. win_iconv just calls into Windows API to do the conversion, it is technically easy to disable the "best fit" conversion, but I think it won't be a good idea. In some cases, perhaps rare, the best fit is good, actually including the conversion from "ř" to "r" which makes perfect sense. But more importantly, changing the behavior could affect users who expect the substitution to happen because it has been happening for many years, and it won't help others much. Tomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
Since it is "technically easy" to disable the best fit conversion and the best fit is rarely good, how about providing an option for code/package authors to disable it? I'm asking because this is one of the most painful issues in packages that may need to source() code containing UTF-8 characters that are not representable in the Windows native encoding. Examples include knitr/rmarkdown and shiny. Basically users won't be able to knit documents or run Shiny apps correctly when the code contains characters that cannot be represented in the native encoding. Regards, Yihui -- https://yihui.name On Wed, Apr 10, 2019 at 6:36 AM Tomas Kalibera wrote: > > On 4/10/19 1:14 PM, Jeroen Ooms wrote: > > On Wed, Apr 10, 2019 at 12:19 PM Tomáš Bořil wrote: > >> Minimalistic example: > >> Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console: > >>> "ř" > >> [1] "r" > >> > >> Although the script is in UTF-8, the characters are replaced by > >> "simplified" substitutes uncontrollably (depending on OS locale). The > >> same goes with simply entering the code statements in R Console. > >> > >> The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) > > I think this is a "feature" of win_iconv that is bundled with base R > > on Windows (./src/extra/win_iconv). The character from your example is > > not part of the latin1 (iso-8859-1) set, however, win-iconv seems to > > do so anyway: > > > >> x <- "\U00159" > >> print(x) > > [1] "ř" > >> iconv(x, 'UTF-8', 'iso-8859-1') > > [1] "r" > > > > On MacOS, iconv tells us this character cannot be represented as latin1: > > > >> x <- "\U00159" > >> print(x) > > [1] "ř" > >> iconv(x, 'UTF-8', 'iso-8859-1') > > [1] NA > > > > I'm actually not sure why base-R needs win_iconv (but I'm not an > > encoding expert at all). Perhaps we could try to unbundle it and use > > the standard libiconv provided by the Rtools toolchain bundle to get > > more consistent results. > > win_iconv just calls into Windows API to do the conversion, it is > technically easy to disable the "best fit" conversion, but I think it > won't be a good idea. In some cases, perhaps rare, the best fit is good, > actually including the conversion from "ř" to "r" which makes perfect > sense. But more importantly, changing the behavior could affect users > who expect the substitution to happen because it has been happening for > many years, and it won't help others much. > > Tomas > > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 4/10/19 1:14 PM, Jeroen Ooms wrote: On Wed, Apr 10, 2019 at 12:19 PM Tomáš Bořil wrote: Minimalistic example: Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console: "ř" [1] "r" Although the script is in UTF-8, the characters are replaced by "simplified" substitutes uncontrollably (depending on OS locale). The same goes with simply entering the code statements in R Console. The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) I think this is a "feature" of win_iconv that is bundled with base R on Windows (./src/extra/win_iconv). The character from your example is not part of the latin1 (iso-8859-1) set, however, win-iconv seems to do so anyway: x <- "\U00159" print(x) [1] "ř" iconv(x, 'UTF-8', 'iso-8859-1') [1] "r" On MacOS, iconv tells us this character cannot be represented as latin1: x <- "\U00159" print(x) [1] "ř" iconv(x, 'UTF-8', 'iso-8859-1') [1] NA I'm actually not sure why base-R needs win_iconv (but I'm not an encoding expert at all). Perhaps we could try to unbundle it and use the standard libiconv provided by the Rtools toolchain bundle to get more consistent results. win_iconv just calls into Windows API to do the conversion, it is technically easy to disable the "best fit" conversion, but I think it won't be a good idea. In some cases, perhaps rare, the best fit is good, actually including the conversion from "ř" to "r" which makes perfect sense. But more importantly, changing the behavior could affect users who expect the substitution to happen because it has been happening for many years, and it won't help others much. Tomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On Wed, Apr 10, 2019 at 12:19 PM Tomáš Bořil wrote: > > Minimalistic example: > Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console: > > "ř" > [1] "r" > > Although the script is in UTF-8, the characters are replaced by > "simplified" substitutes uncontrollably (depending on OS locale). The > same goes with simply entering the code statements in R Console. > > The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) I think this is a "feature" of win_iconv that is bundled with base R on Windows (./src/extra/win_iconv). The character from your example is not part of the latin1 (iso-8859-1) set, however, win-iconv seems to do so anyway: > x <- "\U00159" > print(x) [1] "ř" > iconv(x, 'UTF-8', 'iso-8859-1') [1] "r" On MacOS, iconv tells us this character cannot be represented as latin1: > x <- "\U00159" > print(x) [1] "ř" > iconv(x, 'UTF-8', 'iso-8859-1') [1] NA I'm actually not sure why base-R needs win_iconv (but I'm not an encoding expert at all). Perhaps we could try to unbundle it and use the standard libiconv provided by the Rtools toolchain bundle to get more consistent results. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones
On 4/10/19 10:22 AM, Tomáš Bořil wrote: > Hello, > > There is a long-lasting problem with processing UTF-8 source code in R > on Windows OS. As Windows do not have "UTF-8" locale and R passes > source code through OS before executing it, some characters are > "simplified" by the OS before processing, leading to undesirable > changes. > > Minimalistic example: > Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console: >> "ř" > [1] "r" > > Let's assume the following script: > # file [script.R] > if ("ř" != "\U00159") { > stop("Problem: Unexpected character conversion.") > } else { > cat("o.k.\n") > } > > Problem: > source("script.R", encoding = "UTF-8") > > OK (see > https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding): > eval(parse("script.R", encoding = "UTF-8")) On my system with your example, > source("t.r") Error in eval(ei, envir) : Problem: Unexpected character conversion. > source("/Users/tomas/t.r", encoding="UTF-8") Error in eval(ei, envir) : Problem: Unexpected character conversion.. > eval(parse("t.r", encoding="UTF-8")) o.k. Which is expected, unfortunately. As per documentation of ?source, the "encoding" argument tells source() that the input is in UTF-8, so that source() can convert it to the native encoding. Again as documented, parse() uses its encoding argument to mark the encoding of the strings, but it does not re-encode, and the character strings in the parsed result will as documented have the encoding mark (UTF-8 in this case). > Although the script is in UTF-8, the characters are replaced by > "simplified" substitutes uncontrollably (depending on OS locale). The > same goes with simply entering the code statements in R Console. > > The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...) Yes. By default, Windows uses "best fit" when translating characters to the native encoding. This could be changed in principle, but could break existing applications that may depend on it, and it won't really help because such characters cannot be represented anyway. You can find more in ?Encoding, but yes, it is a known problem frequently encountered by users and unless Windows starts supporting UTF-8 as native encoding, there is no easy fix (a version from Windows 10 Insider preview supports it, so maybe that is not completely hopeless). In theory you can carefully read the documentation and use only functions that can work with UTF-8 without converting to native encoding, but pragmatically, if you want to work with UTF-8 files in R, it is best to use a non-Windows platform. Best Tomas > > Best regards > Tomas Boril > >> R.version > _ > platform x86_64-w64-mingw32 > arch x86_64 > os mingw32 > system x86_64, mingw32 > status alpha > major 3 > minor 6.0 > year 2019 > month 04 > day07 > svn rev76333 > language R > version.string R version 3.6.0 alpha (2019-04-07 r76333) > nickname > >> Sys.getlocale() > [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel