Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments
On Friday 20 January 2012, Simon Urbanek wrote: On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote: On Thursday 19 January 2012, peter dalgaard wrote: On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote: plain(Zürich) ## works plain(Z\u00BCrich) ## fails escaped(Zürich) ## fails escaped(Z\u00BCrich) ## works Using the correct UTF-8 code helps quite a bit: U+00BC ¼ c2 bc VULGAR FRACTION ONE QUARTER U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS Thank you for pointing that out. How embarrassing - I systematically used the wrong representations. Even worse, I didn't carefully read Writing R Extensions which speaks of Unicode as \u escapes rather than UTF-8 as \u escapes, so e.g. looking up the UTF-16 byte representations would have done the trick. I didn't find a recommended method of replacing non-ASCII characters with Unicode \u escape sequences and ended up using the Unix command line tool iconv. However, the iconv version installed on my GNU/Linux machine (openSUSE 11.4) seems to be outdated and doesn't support the very useful -- unicode-subst option yet. I installed libiconv from http://www.gnu.org/software/libiconv/, and now I can easily replace all non- ASCII characters in my UTF-8 encoded R files with: iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R You can actually do that with R alone: ## you'll have to make sure that you're in C locale so R does the conversion for you Sys.setlocale(,C) utf8conv - function(conn) gsub(U\\+([0-9A-F]{4}),u\\1,capture.output(writeLines(readLines(conn,encoding=UTF-8 writeLines(utf8conv(test.txt)) M\u00F6gliche L\u00F6sung ne nebezpe\u010Dn\u00E9 Cheers, Simon Thanks for the above function (which I wouldn't have managed to construct, ever...). Maybe this is worth mentioning in the Writing R Extensions manual (next to where the \u Unicode escape sequences are mentioned). Thomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments
On Thursday 19 January 2012, peter dalgaard wrote: On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote: plain(Zürich) ## works plain(Z\u00BCrich) ## fails escaped(Zürich) ## fails escaped(Z\u00BCrich) ## works Using the correct UTF-8 code helps quite a bit: U+00BC¼ c2 bc VULGAR FRACTION ONE QUARTER U+00FCü c3 bc LATIN SMALL LETTER U WITH DIAERESIS Thank you for pointing that out. How embarrassing - I systematically used the wrong representations. Even worse, I didn't carefully read Writing R Extensions which speaks of Unicode as \u escapes rather than UTF-8 as \u escapes, so e.g. looking up the UTF-16 byte representations would have done the trick. I didn't find a recommended method of replacing non-ASCII characters with Unicode \u escape sequences and ended up using the Unix command line tool iconv. However, the iconv version installed on my GNU/Linux machine (openSUSE 11.4) seems to be outdated and doesn't support the very useful -- unicode-subst option yet. I installed libiconv from http://www.gnu.org/software/libiconv/, and now I can easily replace all non- ASCII characters in my UTF-8 encoded R files with: iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R Thomas Zumbrunn __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments
I installed libiconv from http://www.gnu.org/software/libiconv/, and now I can easily replace all non- ASCII characters in my UTF-8 encoded R files with: iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R Maybe it would be possible to create an R package that exposes an R interface to libiconv, in a smililar spirt as how the package XML interfaces to libxml2 and RCurl to libcurl, etc [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments
On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote: On Thursday 19 January 2012, peter dalgaard wrote: On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote: plain(Zürich) ## works plain(Z\u00BCrich) ## fails escaped(Zürich) ## fails escaped(Z\u00BCrich) ## works Using the correct UTF-8 code helps quite a bit: U+00BC ¼ c2 bc VULGAR FRACTION ONE QUARTER U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS Thank you for pointing that out. How embarrassing - I systematically used the wrong representations. Even worse, I didn't carefully read Writing R Extensions which speaks of Unicode as \u escapes rather than UTF-8 as \u escapes, so e.g. looking up the UTF-16 byte representations would have done the trick. I didn't find a recommended method of replacing non-ASCII characters with Unicode \u escape sequences and ended up using the Unix command line tool iconv. However, the iconv version installed on my GNU/Linux machine (openSUSE 11.4) seems to be outdated and doesn't support the very useful -- unicode-subst option yet. I installed libiconv from http://www.gnu.org/software/libiconv/, and now I can easily replace all non- ASCII characters in my UTF-8 encoded R files with: iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R You can actually do that with R alone: ## you'll have to make sure that you're in C locale so R does the conversion for you Sys.setlocale(,C) utf8conv - function(conn) gsub(U\\+([0-9A-F]{4}),u\\1,capture.output(writeLines(readLines(conn,encoding=UTF-8 writeLines(utf8conv(test.txt)) M\u00F6gliche L\u00F6sung ne nebezpe\u010Dn\u00E9 Cheers, Simon __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments
On Jan 19, 2012, at 7:27 PM, Jeroen Ooms wrote: I installed libiconv from http://www.gnu.org/software/libiconv/, and now I can easily replace all non- ASCII characters in my UTF-8 encoded R files with: iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R Maybe it would be possible to create an R package that exposes an R interface to libiconv, in a smililar spirt as how the package XML interfaces to libxml2 and RCurl to libcurl, etc Well, R does that - see ?iconv ... __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments
On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote: plain(Zürich) ## works plain(Z\u00BCrich) ## fails escaped(Zürich) ## fails escaped(Z\u00BCrich) ## works Using the correct UTF-8 code helps quite a bit: U+00BC ¼ c2 bc VULGAR FRACTION ONE QUARTER U+00FC ü c3 bc LATIN SMALL LETTER U WITH DIAERESIS -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel