Re: sending/receiving UTF-8 characters from terminal to program
It seems that all my previously sent emails to say thanks got lost so I give it another try: thanks to all who got involved and helped me! Problem is solved :) Best regards, r0ller On 1/21/23 12:26 AM, RVP wrote: On Fri, 20 Jan 2023, Valery Ushakov wrote: On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote: Well, checking what printf results in, I get: $printf 'n?z'|hexdump -C 6e e9 7a |n.z| 0003 $printf $'n\uE9z'|hexdump -C 6e c3 a9 7a |n..z| 0004 It's definitely different from what you got for 'n?z'. What does that mean? In the second example you specify \uE9 which is the unicode code point for e with acute. It is then uncondionally converted by printf to UTF-8 (which is two bytes: 0xc3 0xa9) on output. Your terminal input is in 8859-1 it seems. That's it. The terminal emulator is not generating UTF-8 from the keyboard input. May be you need to specify -u8 option or utf8 resource? That would work. So would running uxterm instead of xterm, but, all of these mess-up command-line editing: Alt+key is converted into a char. code instead of an ESC+key sequence. R0ller, do this: 1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm): export LANG=hu_HU.UTF-8 export LC_CTYPE=hu_HU.UTF-8 export LC_MESSAGES=hu_HU.UTF-8 2. In ~/.Xresources, tell xterm to use the current locale when generating chars.: XTerm*locale: true The `-lc' option does the same thing. If using uxterm, the class-name becomes `UXTerm'. On Fri, 20 Jan 2023, Robert Elz wrote: I believe bash will take your current locale into account when doing that [...] That's correct. But as r0ller had a UTF-8 locale set, I didn't mention that. However, it is better to be precise, so thank you! -RVP
Re: sending/receiving UTF-8 characters from terminal to program
Hi All, That did the trick. Thanks to everyone! BR, r0ller PS: I replied already yesterday but that seems to get lost somehow. On 1/21/23 12:26 AM, RVP wrote: On Fri, 20 Jan 2023, Valery Ushakov wrote: On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote: Well, checking what printf results in, I get: $printf 'n?z'|hexdump -C 6e e9 7a |n.z| 0003 $printf $'n\uE9z'|hexdump -C 6e c3 a9 7a |n..z| 0004 It's definitely different from what you got for 'n?z'. What does that mean? In the second example you specify \uE9 which is the unicode code point for e with acute. It is then uncondionally converted by printf to UTF-8 (which is two bytes: 0xc3 0xa9) on output. Your terminal input is in 8859-1 it seems. That's it. The terminal emulator is not generating UTF-8 from the keyboard input. May be you need to specify -u8 option or utf8 resource? That would work. So would running uxterm instead of xterm, but, all of these mess-up command-line editing: Alt+key is converted into a char. code instead of an ESC+key sequence. R0ller, do this: 1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm): export LANG=hu_HU.UTF-8 export LC_CTYPE=hu_HU.UTF-8 export LC_MESSAGES=hu_HU.UTF-8 2. In ~/.Xresources, tell xterm to use the current locale when generating chars.: XTerm*locale: true The `-lc' option does the same thing. If using uxterm, the class-name becomes `UXTerm'. On Fri, 20 Jan 2023, Robert Elz wrote: I believe bash will take your current locale into account when doing that [...] That's correct. But as r0ller had a UTF-8 locale set, I didn't mention that. However, it is better to be precise, so thank you! -RVP
Re: sending/receiving UTF-8 characters from terminal to program
Guys, thanks for everyone! Problem solved :D I'm really grateful to all of you. Best regards, r0ller On 1/21/23 12:26 AM, RVP wrote: On Fri, 20 Jan 2023, Valery Ushakov wrote: On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote: Well, checking what printf results in, I get: $printf 'n?z'|hexdump -C 6e e9 7a |n.z| 0003 $printf $'n\uE9z'|hexdump -C 6e c3 a9 7a |n..z| 0004 It's definitely different from what you got for 'n?z'. What does that mean? In the second example you specify \uE9 which is the unicode code point for e with acute. It is then uncondionally converted by printf to UTF-8 (which is two bytes: 0xc3 0xa9) on output. Your terminal input is in 8859-1 it seems. That's it. The terminal emulator is not generating UTF-8 from the keyboard input. May be you need to specify -u8 option or utf8 resource? That would work. So would running uxterm instead of xterm, but, all of these mess-up command-line editing: Alt+key is converted into a char. code instead of an ESC+key sequence. R0ller, do this: 1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm): export LANG=hu_HU.UTF-8 export LC_CTYPE=hu_HU.UTF-8 export LC_MESSAGES=hu_HU.UTF-8 2. In ~/.Xresources, tell xterm to use the current locale when generating chars.: XTerm*locale: true The `-lc' option does the same thing. If using uxterm, the class-name becomes `UXTerm'. On Fri, 20 Jan 2023, Robert Elz wrote: I believe bash will take your current locale into account when doing that [...] That's correct. But as r0ller had a UTF-8 locale set, I didn't mention that. However, it is better to be precise, so thank you! -RVP
Re: sending/receiving UTF-8 characters from terminal to program
On Fri, 20 Jan 2023, Thomas Dickey wrote: May be you need to specify -u8 option or utf8 resource? That would work. So would running uxterm instead of xterm, but, all of these mess-up command-line editing: Alt+key is converted into a char. code instead of an ESC+key sequence. perhaps you're referring to eightBitInput (see manpage) That seems to get set when running as uxterm/UTF-8 locale. However, `XTerm*metaSendsEscape: true' (which I have set for XTerm, but, not for UXTerm) fixes things right back, so it's all OK :) XTerm*locale: true that's redundant, since the default "medium" will give the same effect :-) Great! That's something I didn't know. Thanks, Tom. Cheers, -RVP
Re: sending/receiving UTF-8 characters from terminal to program
On Fri, Jan 20, 2023 at 11:26:49PM +, RVP wrote: > On Fri, 20 Jan 2023, Valery Ushakov wrote: > > > On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote: > > > > > Well, checking what printf results in, I get: > > > > > > $printf 'n?z'|hexdump -C > > > 6e e9 7a |n.z| > > > 0003 > > > $printf $'n\uE9z'|hexdump -C > > > 6e c3 a9 7a |n..z| > > > 0004 > > > > > > It's definitely different from what you got for 'n?z'. What does > > > that mean? > > > > In the second example you specify \uE9 which is the unicode code point > > for e with acute. It is then uncondionally converted by printf to > > UTF-8 (which is two bytes: 0xc3 0xa9) on output. > > > > Your terminal input is in 8859-1 it seems. > > > > That's it. The terminal emulator is not generating UTF-8 from the keyboard > input. > > > May be you need to specify -u8 option or utf8 resource? > > > > That would work. So would running uxterm instead of xterm, but, all of > these mess-up command-line editing: Alt+key is converted into a char. > code instead of an ESC+key sequence. perhaps you're referring to eightBitInput (see manpage) > R0ller, do this: > > 1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm): > > export LANG=hu_HU.UTF-8 > export LC_CTYPE=hu_HU.UTF-8 > export LC_MESSAGES=hu_HU.UTF-8 R0ller wasn't clear about whether this was done (outside the terminal). Actually, R0ller didn't mention whether the terminal was the graphical environment or the console (from the comments, I assumed the latter). > 2. In ~/.Xresources, tell xterm to use the current locale when generating >chars.: > > XTerm*locale: true that's redundant, since the default "medium" will give the same effect :-) >The `-lc' option does the same thing. If using uxterm, the class-name >becomes `UXTerm'. > > > > > On Fri, 20 Jan 2023, Robert Elz wrote: > > > I believe bash will take your current locale into account > > when doing that [...] > > > > That's correct. But as r0ller had a UTF-8 locale set, I didn't mention that. > However, it is better to be precise, so thank you! > > -RVP -- Thomas E. Dickey https://invisible-island.net signature.asc Description: PGP signature
Re: sending/receiving UTF-8 characters from terminal to program
On Fri, 20 Jan 2023, Valery Ushakov wrote: On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote: Well, checking what printf results in, I get: $printf 'n?z'|hexdump -C 6e e9 7a |n.z| 0003 $printf $'n\uE9z'|hexdump -C 6e c3 a9 7a |n..z| 0004 It's definitely different from what you got for 'n?z'. What does that mean? In the second example you specify \uE9 which is the unicode code point for e with acute. It is then uncondionally converted by printf to UTF-8 (which is two bytes: 0xc3 0xa9) on output. Your terminal input is in 8859-1 it seems. That's it. The terminal emulator is not generating UTF-8 from the keyboard input. May be you need to specify -u8 option or utf8 resource? That would work. So would running uxterm instead of xterm, but, all of these mess-up command-line editing: Alt+key is converted into a char. code instead of an ESC+key sequence. R0ller, do this: 1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm): export LANG=hu_HU.UTF-8 export LC_CTYPE=hu_HU.UTF-8 export LC_MESSAGES=hu_HU.UTF-8 2. In ~/.Xresources, tell xterm to use the current locale when generating chars.: XTerm*locale: true The `-lc' option does the same thing. If using uxterm, the class-name becomes `UXTerm'. On Fri, 20 Jan 2023, Robert Elz wrote: I believe bash will take your current locale into account when doing that [...] That's correct. But as r0ller had a UTF-8 locale set, I didn't mention that. However, it is better to be precise, so thank you! -RVP
Re: sending/receiving UTF-8 characters from terminal to program
On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote: > Well, checking what printf results in, I get: > > $printf 'n?z'|hexdump -C > 6e e9 7a |n.z| > 0003 > $printf $'n\uE9z'|hexdump -C > 6e c3 a9 7a |n..z| > 0004 > > It's definitely different from what you got for 'n?z'. What does > that mean? In the second example you specify \uE9 which is the unicode code point for e with acute. It is then uncondionally converted by printf to UTF-8 (which is two bytes: 0xc3 0xa9) on output. Your terminal input is in 8859-1 it seems. 0xe9 in the first example is "LATIN SMALL LETTER E WITH ACUTE" that is unicode code point \u00E9 which is encoded in latin-1 as 0xE9. So your terminal inserted 0xe9 when you pressed that key. May be you need to specify -u8 option or utf8 resource? (I'm mostly using netbsd headless, so I haven't been following the current status of utf8 support in X). -uwe
Re: sending/receiving UTF-8 characters from terminal to program
Well, checking what printf results in, I get: $printf 'néz'|hexdump -C 6e e9 7a |n.z| 0003 $printf $'n\uE9z'|hexdump -C 6e c3 a9 7a |n..z| 0004 It's definitely different from what you got for 'néz'. What does that mean? Thanks, r0ller On 1/20/23 9:55 AM, RVP wrote: On Fri, 20 Jan 2023, r0ller wrote: Thanks for your efforts to reproduce it :) I just don't get why it works for you with the same locales and why it doesn't for me. Are there any other settings that affect encoding besides LC variables and LANG? Since we seem to have the same flookup binary, check against the magyar.fst I used: https://github.com/r0ller/alice/tree/master/hi_android/foma Next check that the input you're feeding to flookup actually _is_ UTF-8. Both /bin/sh and bash output UTF-8 if given Unicode code- points in the form `\u'. So, $ printf 'néz' | hexdump -C 6e c3 a9 7a |n..z| 0004 $ printf $'n\uE9z' | hexdump -C 6e c3 a9 7a |n..z| 0004 $ If that works, then check those UTF-8 bytes against whatever the terminal emulator generated from your keystrokes for the `' in `néz'. -RVP
Re: sending/receiving UTF-8 characters from terminal to program
Date:Fri, 20 Jan 2023 08:55:45 + (UTC) From:RVP Message-ID: <4dd21c1f-f5c3-c3ba-96d8-cab73a0b...@sdf.org> | Both /bin/sh and bash output UTF-8 if given Unicode code- | points in the form `\u'. So, I believe bash will take your current locale into account when doing that, whereas neither /bin/sh nor /usr/bin/printf do, they simply emit UTF-8 unconditionally. This kind of difference is (partly) why POSIX is not including the \u (or \U) escape sequences in $'...' quoted strings in Issue 8. Another is how the end of the is detected, is it always exactly 4 hex digits (or 8 for \U), or any number up to 4 (or 8) if followed by a non-hex char, or using as many hex chars as exist? To be portable (as input) such a string needs to be exactly 4 (8) hex digits, and be followed by something which is not a hex digit - the closing ' is often useful there, it can always be followed immediately by $' to resume quoting again (or just ' or " if those are adequate). But that's just the input, you also need to be using a locale using UTF-8 char encoding to get predictable output. kre | | $ printf 'néz' | hexdump -C | 6e c3 a9 7a |n..z| | 0004 | $ printf $'n\uE9z' | hexdump -C | 6e c3 a9 7a |n..z| | 0004 | $ | | If that works, then check those UTF-8 bytes against whatever the | terminal emulator generated from your keystrokes for the `' | in `néz'. | | -RVP | | --0-494486379-1674204946=:18222-- |
Re: sending/receiving UTF-8 characters from terminal to program
On Fri, 20 Jan 2023, r0ller wrote: Thanks for your efforts to reproduce it :) I just don't get why it works for you with the same locales and why it doesn't for me. Are there any other settings that affect encoding besides LC variables and LANG? Since we seem to have the same flookup binary, check against the magyar.fst I used: https://github.com/r0ller/alice/tree/master/hi_android/foma Next check that the input you're feeding to flookup actually _is_ UTF-8. Both /bin/sh and bash output UTF-8 if given Unicode code- points in the form `\u'. So, $ printf 'néz' | hexdump -C 6e c3 a9 7a |n..z| 0004 $ printf $'n\uE9z' | hexdump -C 6e c3 a9 7a |n..z| 0004 $ If that works, then check those UTF-8 bytes against whatever the terminal emulator generated from your keystrokes for the `' in `néz'. -RVP
Re: sending/receiving UTF-8 characters from terminal to program
Thanks for your efforts to reproduce it :) I just don't get why it works for you with the same locales and why it doesn't for me. Are there any other settings that affect encoding besides LC variables and LANG?Best regards,r0ller Eredeti levél Feladó: RVP Dátum: 2023 január 19 12:44:25Tárgy: Re: sending/receiving UTF-8 characters from terminal to programCímzett: r0ller On Wed, 18 Jan 2023, r0ller wrote: > echo néz|flookup magyar.fst > > it results in: > > néz +? > > However, when passing the string as: > > echo néz|flookup magyar.fst > > I get a successful analysis: > > néz +swConsonant+néz[stem]+CON > néz +swConsonant+néz[stem]+CON+Nom > néz néz[stem]+Verb+IndefSg3 > That should work--and it does. With a just compiled flookup (from foma-0.9.18.tar.gz in the link you provided, and the .fst file got by googling): ``` $ uname -a NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan 7 15:04:01 UTC 2023 mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 $ export LANG=hu_HU.UTF-8 $ export LC_CTYPE=hu_HU.UTF-8 $ export LC_MESSAGES=hu_HU.UTF-8 $ /tmp/F/bin/flookup -v flookup 1.03 (foma library version 0.9.18alpha) $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz +swConsonant+néz[stem]+CON néz +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3 $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz+? $ ``` -RVP
Re: sending/receiving UTF-8 characters from terminal to program
Thanks! That's what I suspect as well but I just can't figure out how to fix it.Best regards,r0ller Eredeti levél Feladó: Rhialto Dátum: 2023 január 19 16:33:08Tárgy: Re: sending/receiving UTF-8 characters from terminal to programCímzett: RVP I think there is some encoding confusion going on here. I still use boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the word nz differently. > > echo néz|flookup magyar.fst > > echo néz|flookup magyar.fst The first case with 1 letter in the middle looking like an e + aigu, the second time as 2 characters, probably an utf-8 encoding. I'm not sure if this helps you directly but it may be a hint for something. -Olaf. -- ___ "Buying carbon credits is a bit like a serial killer paying someone else to \X/ have kids to make his activity cost neutral." -The BOFHfalu.nl@rhialto
Re: sending/receiving UTF-8 characters from terminal to program
On Wed, 18 Jan 2023, r0ller wrote: Actually, it's just inconvenient to always type the strings what I want to analyse in a program, compile and execute it instead of giving it a go from the shell. In a pinch, you can always use iconv(1) to do your conversions: ``` $ printf néz | hexdump -# locale is en_GB.UTF-8 ie. UTF-8 6e c3 a9 7a |n..z| 0004 $ printf néz | iconv -f UTF-8 -t ISO-8859-1 | hexdump -C 6e e9 7a |n.z| 0003 $ ``` -RVP
Re: sending/receiving UTF-8 characters from terminal to program
On Thu, 19 Jan 2023, Rhialto wrote: I think there is some encoding confusion going on here. I still use boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the word nz differently. echo néz|flookup magyar.fst echo néz|flookup magyar.fst The first case with 1 letter in the middle looking like an e + aigu, the second time as 2 characters, probably an utf-8 encoding. Yeah, some kind of encoding mismatch is responsible. Everything worked for me because a) all the text I copy-pasted were UTF8 (even the 2nd one, which, unsurprisingly, didn't work.) b) flookup was OK with UTF-8 input (or my converted UTF-8 input to match the text encoding in magyar.fst) c) text in magyar.fst was in UTF-8/Unicode (or, if another encoding, then flookup did the conversion before doing the text lookup.) b) and c) are educated guesses. -RVP
Re: sending/receiving UTF-8 characters from terminal to program
I think there is some encoding confusion going on here. I still use boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the word nz differently. > > echo néz|flookup magyar.fst > > echo néz|flookup magyar.fst The first case with 1 letter in the middle looking like an e + aigu, the second time as 2 characters, probably an utf-8 encoding. I'm not sure if this helps you directly but it may be a hint for something. -Olaf. -- ___ "Buying carbon credits is a bit like a serial killer paying someone else to \X/ have kids to make his activity cost neutral." -The BOFHfalu.nl@rhialto signature.asc Description: PGP signature
Re: sending/receiving UTF-8 characters from terminal to program
On Wed, 18 Jan 2023, r0ller wrote: echo néz|flookup magyar.fst it results in: néz +? However, when passing the string as: echo néz|flookup magyar.fst I get a successful analysis: néz +swConsonant+néz[stem]+CON néz +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3 That should work--and it does. With a just compiled flookup (from foma-0.9.18.tar.gz in the link you provided, and the .fst file got by googling): ``` $ uname -a NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan 7 15:04:01 UTC 2023 mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 $ export LANG=hu_HU.UTF-8 $ export LC_CTYPE=hu_HU.UTF-8 $ export LC_MESSAGES=hu_HU.UTF-8 $ /tmp/F/bin/flookup -v flookup 1.03 (foma library version 0.9.18alpha) $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz +swConsonant+néz[stem]+CON néz +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3 $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz+? $ ``` -RVP
sending/receiving UTF-8 characters from terminal to program
Hi All,My locale is as follows:export LANG="hu_HU.UTF-8"export LC_CTYPE="hu_HU.UTF-8"export LC_MESSAGES="hu_HU.UTF-8"The problem is that even though I can type the special characters of the locale everywhere (though it's funny in the terminal where they only appear after the third key press) in X but when echoing a string containing such chars like (áéíóöőúű) and piping them to a program, they arrive somehow differently compared to when the same string is passed by calling the same program. I don't have any other example than foma (https://fomafst.github.io). It has a tool called flookup which requires special morphological dictionaries which most probably noone uses here (except me) but when passing a string from the command line like:echo néz|flookup magyar.fstit results in:néz +?However, when passing the string as:echo néz|flookup magyar.fstI get a successful analysis:néz +swConsonant+néz[stem]+CONnéz +swConsonant+néz[stem]+CON+Nomnéz néz[stem]+Verb+IndefSg3When calling the api function behind flookup from a program passing the string 'néz', I also get the analysis successfully. I don't have a clear explanation for this (only partially) and I also wonder why the terminal does not translate the locale special utf-8 bytes back to a character when they're printed by the program. Actually, it's just inconvenient to always type the strings what I want to analyse in a program, compile and execute it instead of giving it a go from the shell.Could anyone explain me what happens here and how I can handle it?Thanks,r0ller