Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

2012-01-20 Thread Thomas Zumbrunn
On Friday 20 January 2012, Simon Urbanek wrote:
 On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote:
  On Thursday 19 January 2012, peter dalgaard wrote:
  On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:
   plain(Zürich)  ## works
   plain(Z\u00BCrich)  ## fails
   escaped(Zürich)  ## fails
   escaped(Z\u00BCrich)  ## works
  
  Using the correct UTF-8 code helps quite a bit:
  
  U+00BC ¼   c2 bc   VULGAR FRACTION ONE QUARTER
  U+00FC ü   c3 bc   LATIN SMALL LETTER U WITH DIAERESIS
  
  Thank you for pointing that out. How embarrassing - I systematically used
  the wrong representations. Even worse, I didn't carefully read Writing
  R Extensions which speaks of Unicode as \u escapes rather than
  UTF-8 as \u escapes, so e.g. looking up the UTF-16 byte
  representations would have done the trick.
  
  I didn't find a recommended method of replacing non-ASCII characters with
  Unicode \u escape sequences and ended up using the Unix command line
  tool iconv. However, the iconv version installed on my GNU/Linux
  machine (openSUSE 11.4) seems to be outdated and doesn't support the
  very useful -- unicode-subst option yet. I installed libiconv from
  http://www.gnu.org/software/libiconv/, and now I can easily replace all
  non-
  
  ASCII characters in my UTF-8 encoded R files with:
   iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R
 
 You can actually do that with R alone:
 
 ## you'll have to make sure that you're in C locale so R does the conversion 
 for you
 Sys.setlocale(,C)
 
 utf8conv - function(conn)
 gsub(U\\+([0-9A-F]{4}),u\\1,capture.output(writeLines(readLines(conn,encoding=UTF-8
 
  writeLines(utf8conv(test.txt))
 
 M\u00F6gliche L\u00F6sung
 ne nebezpe\u010Dn\u00E9
 
 Cheers,
 Simon

Thanks for the above function (which I wouldn't have managed to construct, 
ever...). Maybe this is worth mentioning in the 
Writing R Extensions manual (next to where the \u Unicode escape 
sequences are mentioned).

Thomas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

2012-01-19 Thread Thomas Zumbrunn
On Thursday 19 January 2012, peter dalgaard wrote:
 On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:
plain(Zürich)  ## works
plain(Z\u00BCrich)  ## fails
escaped(Zürich)  ## fails
escaped(Z\u00BCrich)  ## works
 
 Using the correct UTF-8 code helps quite a bit:
 
 U+00BC¼   c2 bc   VULGAR FRACTION ONE QUARTER
 U+00FCü   c3 bc   LATIN SMALL LETTER U WITH DIAERESIS

Thank you for pointing that out. How embarrassing - I systematically used the 
wrong representations. Even worse, I didn't carefully read Writing R 
Extensions which speaks of Unicode as \u escapes rather than UTF-8 as 
\u escapes, so e.g. looking up the UTF-16 byte representations would have 
done the trick.

I didn't find a recommended method of replacing non-ASCII characters with 
Unicode \u escape sequences and ended up using the Unix command line tool 
iconv. However, the iconv version installed on my GNU/Linux machine 
(openSUSE 11.4) seems to be outdated and doesn't support the very useful --
unicode-subst option yet. I installed libiconv from 
http://www.gnu.org/software/libiconv/, and now I can easily replace all non-
ASCII characters in my UTF-8 encoded R files with:

  iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R

Thomas Zumbrunn

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

2012-01-19 Thread Jeroen Ooms

 I installed libiconv from http://www.gnu.org/software/libiconv/, and
 now I can easily replace all non- ASCII characters in my UTF-8 encoded R
 files with: iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X
 my-utf-8-encoded-file.R


Maybe it would be possible to create an R package that exposes an R
interface to libiconv, in a smililar spirt as how the package XML
interfaces to libxml2 and RCurl to libcurl, etc

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

2012-01-19 Thread Simon Urbanek

On Jan 19, 2012, at 6:39 PM, Thomas Zumbrunn wrote:

 On Thursday 19 January 2012, peter dalgaard wrote:
 On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:
  plain(Zürich)  ## works
  plain(Z\u00BCrich)  ## fails
  escaped(Zürich)  ## fails
  escaped(Z\u00BCrich)  ## works
 
 Using the correct UTF-8 code helps quite a bit:
 
 U+00BC   ¼   c2 bc   VULGAR FRACTION ONE QUARTER
 U+00FC   ü   c3 bc   LATIN SMALL LETTER U WITH DIAERESIS
 
 Thank you for pointing that out. How embarrassing - I systematically used the 
 wrong representations. Even worse, I didn't carefully read Writing R 
 Extensions which speaks of Unicode as \u escapes rather than UTF-8 as 
 \u escapes, so e.g. looking up the UTF-16 byte representations would 
 have 
 done the trick.
 
 I didn't find a recommended method of replacing non-ASCII characters with 
 Unicode \u escape sequences and ended up using the Unix command line tool 
 iconv. However, the iconv version installed on my GNU/Linux machine 
 (openSUSE 11.4) seems to be outdated and doesn't support the very useful --
 unicode-subst option yet. I installed libiconv from 
 http://www.gnu.org/software/libiconv/, and now I can easily replace all non-
 ASCII characters in my UTF-8 encoded R files with:
 
  iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X my-utf-8-encoded-file.R
 


You can actually do that with R alone:

## you'll have to make sure that you're in C locale so R does the conversion 
for you
Sys.setlocale(,C)

utf8conv - function(conn) 
gsub(U\\+([0-9A-F]{4}),u\\1,capture.output(writeLines(readLines(conn,encoding=UTF-8

 writeLines(utf8conv(test.txt))
M\u00F6gliche L\u00F6sung
ne nebezpe\u010Dn\u00E9

Cheers,
Simon

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

2012-01-19 Thread Simon Urbanek

On Jan 19, 2012, at 7:27 PM, Jeroen Ooms wrote:

 
 I installed libiconv from http://www.gnu.org/software/libiconv/, and
 now I can easily replace all non- ASCII characters in my UTF-8 encoded R
 files with: iconv -f UTF-8 -t ASCII --unicode-subst=\u%04X
 my-utf-8-encoded-file.R
 
 
 Maybe it would be possible to create an R package that exposes an R
 interface to libiconv, in a smililar spirt as how the package XML
 interfaces to libxml2 and RCurl to libcurl, etc
 

Well, R does that - see ?iconv ...

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] use of UTF-8 \uxxxx escape sequences in function arguments

2012-01-18 Thread peter dalgaard

On Jan 18, 2012, at 23:54 , Thomas Zumbrunn wrote:

   plain(Zürich)  ## works
   plain(Z\u00BCrich)  ## fails
   escaped(Zürich)  ## fails
   escaped(Z\u00BCrich)  ## works

Using the correct UTF-8 code helps quite a bit:

U+00BC  ¼   c2 bc   VULGAR FRACTION ONE QUARTER
U+00FC  ü   c3 bc   LATIN SMALL LETTER U WITH DIAERESIS

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel