Re: [R] UTF-8 to the console

2022-10-13 Thread Tomas Kalibera



On 6/23/22 13:36, Ivan Krylov wrote:

On Thu, 23 Jun 2022 12:26:23 +0200
Helmut Schütz  wrote:


txt <- "x ≥ y, x \u2265 y; a ≈ b, a \u2248 b"
Encoding(txt) <- "UTF-8"

There shouldn't be a need to change the encoding. If you're creating a
Unicode literal, R should already choose UTF-8 for the resulting
string. Either way, R automatically converts the strings from their
source encoding on output.

Moreover, `Encoding<-` doesn't perform any conversion, it only changes
the declared encoding on the string, affecting the way it may be
encoded or decoded in the future. If Encoding(txt) wasn't already UTF-8,
you would likely be damaging the data:

string <- 'Ы' # is already UTF-8
# No conversion happens, the same bytes re-interpreted differently
Encoding(string) <- 'latin1'
string
# [1] "Ы"


R 4.2.0 on Windows 7

On Windows 7, Rterm will stay limited to the OEM encoding, since UCRT
only supports UTF-8 locales on Windows ≥ 10, version 1903. If your OEM
encoding doesn't have the ≥, ≈ characters, printing them to the console
is going to be hard. Not impossible -- e.g. an R extension written in C
could obtain a handle to the current console and use Unicode-aware
Windows API to print these characters -- but just getting it to work
would be hard, and it will be likely unportable.


and Windows 11.

I think it should be possible. What does system('chcp') say in your
Rterm session?

For console UTF-8 output to work, two things should happen:

1. The console must be using UTF-8, i.e. chcp must say it's using code
page 65001.

2. Rterm must understand that and also use UTF-8 on output.

What does sessionInfo() and l10n_info() say in your Rterm session on
Windows 11? In Rterm source code, I see a check for GetACP() == 65001,
which should have switched the console encoding to UTF-8 automatically.

Perhaps you need to run chcp 65001 before starting Rterm? Maybe you
need to set a checkbox [*] to make the ANSI codepage UTF-8 by default?
I'm not sure any of this is going to work, but it's something to try
before someone more knowledgeable with R on Windows can help you.


You are right, R already tries to set the console code page itself to 
65001, so chcp is no longer needed to change it.


On Windows 7 UTF-8 would not be used by R because it can't be the 
"system encoding" (ACP), and I suppose Helmut's output was from that 
version of Windows:


x = y, x = y; a ˜ b, a ˜ b

This suggests transliteration ("best-fit") of the characters not 
representable in the session encoding, done by Windows. On Windows 7, 
characters not representable in the user locale encoding will not be 
usable in Rterm, there is no way around that, but one can e.g. use Rgui.


On new Windows, such as Windows 11, there was a bug in Rterm as I 
reported in the other email, fixed now. The output on my Windows 10 was:


x  y, x ≥ y; a  b, a ≈ b

So the characters pasted were missing, but the characters expressed via 
\u escapes were printed correctly. This was a problem between the 
windows console implementation used in cmd.exe/powershell and 
Rterm/getline. What is characteristic for these problems is that the 
behavior differs from when Rterm runs from the Windows Terminal 
application (or possibly Msys2 mintty/winpty/bash).


Best
Tomas





__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] UTF-8 to the console

2022-10-13 Thread Tomas Kalibera

Dear Helmut,

thanks for the report, this is actually a bug in Rterm (or Windows, hard 
to tell, but something that can be fixed in Rterm). More below


On 6/23/22 12:26, Helmut Schütz wrote:

Dear all,

I want to send UTF-8 characters to the console. Font in the 
GUI-Preference 'Lucida Console', supporting the desired symbols:
greater than or equal: UTF-8 2265, HTML-entity  HTML-Unicode 
 TeX \ge
approximately equal: UTF-8 2248, HTML-entity  HTML-Unicode 
 TeX \approx


txt <- "x ≥ y, x \u2265 y; a ≈ b, a \u2248 b"
Encoding(txt) <- "UTF-8"
print(txt)
[1] "x = y, x = y; a \230 b, a \230 b"
cat(txt, "\n")
x = y, x = y; a ˜ b, a ˜ b

Desired "x ≥ y, x ≥ y; a ≈ b, a ≈ b"

I'm sending the email in UTF-8. Don’t know how @r-project.org is 
configured (ASCII?) If you see garbage, I'm sorry but you should get 
the idea.


R 4.2.0 on Windows 7 (UCRT10.0.10240.16390) and Windows 11.


The underlying problem I can reproduce on my Windows 10 (which is almost 
surely what you are seeing on Windows 11) is that characters ≥ and ≈ 
cannot be pasted to RTerm when running in cmd.exe or PowerShell. Pasting 
these characters pastes nothing.


I've fixed this now in R-devel 83094 (and R-patched 83095). I would be 
grateful if you (or anyone else) could test e.g. in R-patched, most 
likely this example will work as it did for me, but also other examples 
you can think of. Processing the input keys in Rterm/getline is very 
tricky and brittle. What the code sees depends on what the console 
implementation decides to do, and it differs for different console 
implementations, and sadly this is not documented as far as I could find.


Now, the problem you reported does not happen in Msys2/mintty (so 
Rtools42) terminal, because the terminal uses a different console 
implementation. Also, the problem doesn't happen with the Windows 
Terminal application, which has a yet different implementation. If you 
ever needed a work-around to such problems, I would recommend trying the 
Windows Terminal application.


The problem doesn't happen in Rgui, either, but that uses a different 
code path completely on R end, indeed it does not run Rterm.


There is a key combination "Alt+I" you can press in RTerm, which will 
switch to debug mode and will display the keyboard codes R receives (it 
matches the sources in getline.c). When one sees different behavior of 
things like your reported problem in with different console 
implementations, it usually comes with different keyboard codes sent to R.


Your report has been very useful, thanks, and sorry for the long delay. 
I would have spotted it earlier on R bugzilla (or R-devel) list.


Best
Tomas



Helmut


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] UTF-8 to the console

2022-06-23 Thread Olivier Crouzet
Hi, 

from what I can tell, unicode-related issues (in several programming
languages) are often specifically related to the MS Windows operating
system rather than to R (though that does not imply it is irrelevant
here).

You may wish to have a look at:
https://blog.r-project.org/2020/05/02/utf-8-support-on-windows/

This may provide directions to solving your issue.

Yours.
Olivier.



On Thu, 23 Jun
2022 12:26:23 +0200 Helmut Schütz  wrote:

> Dear all,
> 
> I want to send UTF-8 characters to the console. Font in the 
> GUI-Preference 'Lucida Console', supporting the desired symbols:
> greater than or equal: UTF-8 2265, HTML-entity  HTML-Unicode
>  TeX \ge
> approximately equal: UTF-8 2248, HTML-entity  HTML-Unicode 
>  TeX \approx
> 
> txt <- "x ≥ y, x \u2265 y; a ≈ b, a \u2248 b"
> Encoding(txt) <- "UTF-8"
> print(txt)
> [1] "x = y, x = y; a \230 b, a \230 b"
> cat(txt, "\n")
> x = y, x = y; a ˜ b, a ˜ b
> 
> Desired "x ≥ y, x ≥ y; a ≈ b, a ≈ b"
> 
> I'm sending the email in UTF-8. Don’t know how @r-project.org is 
> configured (ASCII?) If you see garbage, I'm sorry but you should get
> the idea.
> 
> R 4.2.0 on Windows 7 (UCRT10.0.10240.16390) and Windows 11.
> 
> Helmut
> -- 
> Ing. Helmut Schütz
> BEBAC – Consultancy Services for
> Bioequivalence and Bioavailability Studies
> Neubaugasse 36/11
> 1070 Vienna, Austria
> E helmut.schu...@bebac.at 
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide commented,
> minimal, self-contained, reproducible code.


-- 
  Olivier Crouzet, PhD
  http://olivier.ghostinthemachine.space
  /Maître de Conférences/
  @LLING - Laboratoire de Linguistique de Nantes
UMR6310 CNRS / Université de Nantes

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] UTF-8 to the console

2022-06-23 Thread Ivan Krylov
On Thu, 23 Jun 2022 12:26:23 +0200
Helmut Schütz  wrote:

> txt <- "x ≥ y, x \u2265 y; a ≈ b, a \u2248 b"
> Encoding(txt) <- "UTF-8"

There shouldn't be a need to change the encoding. If you're creating a
Unicode literal, R should already choose UTF-8 for the resulting
string. Either way, R automatically converts the strings from their
source encoding on output.

Moreover, `Encoding<-` doesn't perform any conversion, it only changes
the declared encoding on the string, affecting the way it may be
encoded or decoded in the future. If Encoding(txt) wasn't already UTF-8,
you would likely be damaging the data:

string <- 'Ы' # is already UTF-8
# No conversion happens, the same bytes re-interpreted differently
Encoding(string) <- 'latin1'
string
# [1] "Ы"

> R 4.2.0 on Windows 7

On Windows 7, Rterm will stay limited to the OEM encoding, since UCRT
only supports UTF-8 locales on Windows ≥ 10, version 1903. If your OEM
encoding doesn't have the ≥, ≈ characters, printing them to the console
is going to be hard. Not impossible -- e.g. an R extension written in C
could obtain a handle to the current console and use Unicode-aware
Windows API to print these characters -- but just getting it to work
would be hard, and it will be likely unportable.

> and Windows 11.

I think it should be possible. What does system('chcp') say in your
Rterm session?

For console UTF-8 output to work, two things should happen:

1. The console must be using UTF-8, i.e. chcp must say it's using code
page 65001.

2. Rterm must understand that and also use UTF-8 on output.

What does sessionInfo() and l10n_info() say in your Rterm session on
Windows 11? In Rterm source code, I see a check for GetACP() == 65001,
which should have switched the console encoding to UTF-8 automatically.

Perhaps you need to run chcp 65001 before starting Rterm? Maybe you
need to set a checkbox [*] to make the ANSI codepage UTF-8 by default?
I'm not sure any of this is going to work, but it's something to try
before someone more knowledgeable with R on Windows can help you.

-- 
Best regards,
Ivan

[*] https://superuser.com/a/1451686

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] UTF-8 to the console

2022-06-23 Thread Ebert,Timothy Aaron
Print(a\u2248b") 
gives approximately equal sign.

-Original Message-
From: R-help  On Behalf Of Ebert,Timothy Aaron
Sent: Thursday, June 23, 2022 6:48 AM
To: Helmut Schütz ; r-help@r-project.org
Subject: Re: [R] UTF-8 to the console

[External Email]

Wikipedia indicates that there are multiple flavors of UTF-8, but here is one 
solution.
Wikipedia lists Unicode characters: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_List-5Fof-5FUnicode-5Fcharacters=DwIGaQ=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=_ZRJNTSCKkcQKQwbaBQKQ_zBkTelA8KXmf8uhMhm3xF9ExvFYoSrRrenLCx-POnl=1fG1QOv4-v_FXJUyws0JVvvpMP3tInvn5mYNsqQw-Hg=
  Way towards the bottom of the article is a table of math symbols.
print("x\u2265y")
print("a\u223cy")


Tim

-Original Message-
From: R-help  On Behalf Of Helmut Schütz
Sent: Thursday, June 23, 2022 6:26 AM
To: r-help@r-project.org
Subject: [R] UTF-8 to the console

[External Email]

Dear all,

I want to send UTF-8 characters to the console. Font in the GUI-Preference 
'Lucida Console', supporting the desired symbols:
greater than or equal: UTF-8 2265, HTML-entity  HTML-Unicode  TeX 
\ge approximately equal: UTF-8 2248, HTML-entity  HTML-Unicode  
TeX \approx

txt <- "x ≥ y, x \u2265 y; a ≈ b, a \u2248 b"
Encoding(txt) <- "UTF-8"
print(txt)
[1] "x = y, x = y; a \230 b, a \230 b"
cat(txt, "\n")
x = y, x = y; a ˜ b, a ˜ b

Desired "x ≥ y, x ≥ y; a ≈ b, a ≈ b"

I'm sending the email in UTF-8. Don’t know how @r-project.org is configured 
(ASCII?) If you see garbage, I'm sorry but you should get the idea.

R 4.2.0 on Windows 7 (UCRT10.0.10240.16390) and Windows 11.

Helmut
--
Ing. Helmut Schütz
BEBAC – Consultancy Services for
Bioequivalence and Bioavailability Studies Neubaugasse 36/11
1070 Vienna, Austria
E helmut.schu...@bebac.at <mailto:helmut.schu...@bebac.at>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp=DwIFaQ=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=8xremj7fegxhsnLOZ-LH70y0JDttRWSw_iNumafCnpOKJtvwv9LZG42rTfrJSPJ4=WYrKq8LmaHVI5aCB3Je6H3CuNkvPGP4cRbbxmRTr6I0=
PLEASE do read the posting guide 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html=DwIFaQ=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=8xremj7fegxhsnLOZ-LH70y0JDttRWSw_iNumafCnpOKJtvwv9LZG42rTfrJSPJ4=5KDJgrBy7d0uYnfEWGMwwZX-jomihY43Kb9slt_Yhxg=
and provide commented, minimal, self-contained, reproducible code.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp=DwIGaQ=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=_ZRJNTSCKkcQKQwbaBQKQ_zBkTelA8KXmf8uhMhm3xF9ExvFYoSrRrenLCx-POnl=mZY_wk0MR-rGfvnCQIrr9f9hFOmKy0khvWTLYdKP1QM=
PLEASE do read the posting guide 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html=DwIGaQ=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=_ZRJNTSCKkcQKQwbaBQKQ_zBkTelA8KXmf8uhMhm3xF9ExvFYoSrRrenLCx-POnl=x4Ajav5jU5eTp75jc2WWdIZmUCQ5lpaXrLvmAepLbXI=
and provide commented, minimal, self-contained, reproducible code.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.