Re: sending/receiving UTF-8 characters from terminal to program

2023-01-23 Thread r0ller
It seems that all my previously sent emails to say thanks got lost so I 
give it another try: thanks to all who got involved and helped me! 
Problem is solved :)


Best regards,

r0ller

On 1/21/23 12:26 AM, RVP wrote:

On Fri, 20 Jan 2023, Valery Ushakov wrote:


On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote:


Well, checking what printf results in, I get:

$printf 'n?z'|hexdump -C
  6e e9 7a |n.z|
0003
$printf $'n\uE9z'|hexdump -C
  6e c3 a9 7a |n..z|
0004

It's definitely different from what you got for 'n?z'. What does
that mean?


In the second example you specify \uE9 which is the unicode code point
for e with acute.  It is then uncondionally converted by printf to
UTF-8 (which is two bytes: 0xc3 0xa9) on output.

Your terminal input is in 8859-1 it seems.



That's it. The terminal emulator is not generating UTF-8 from the 
keyboard

input.


May be you need to specify -u8 option or utf8 resource?



That would work. So would running uxterm instead of xterm, but, all of
these mess-up command-line editing: Alt+key is converted into a char.
code instead of an ESC+key sequence.

R0ller, do this:

1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm):

export LANG=hu_HU.UTF-8
export LC_CTYPE=hu_HU.UTF-8
export LC_MESSAGES=hu_HU.UTF-8


2. In ~/.Xresources, tell xterm to use the current locale when generating
   chars.:

XTerm*locale: true

   The `-lc' option does the same thing. If using uxterm, the class-name
   becomes `UXTerm'.




On Fri, 20 Jan 2023, Robert Elz wrote:


I believe bash will take your current locale into account
when doing that [...]



That's correct. But as r0ller had a UTF-8 locale set, I didn't mention 
that.

However, it is better to be precise, so thank you!

-RVP


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-22 Thread r0ller

Hi All,

That did the trick. Thanks to everyone!

BR,
r0ller

PS: I replied already yesterday but that seems to get lost somehow.

On 1/21/23 12:26 AM, RVP wrote:

On Fri, 20 Jan 2023, Valery Ushakov wrote:


On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote:


Well, checking what printf results in, I get:

$printf 'n?z'|hexdump -C
  6e e9 7a  |n.z|
0003
$printf $'n\uE9z'|hexdump -C
  6e c3 a9 7a   |n..z|
0004

It's definitely different from what you got for 'n?z'. What does
that mean?


In the second example you specify \uE9 which is the unicode code point
for e with acute.  It is then uncondionally converted by printf to
UTF-8 (which is two bytes: 0xc3 0xa9) on output.

Your terminal input is in 8859-1 it seems.



That's it. The terminal emulator is not generating UTF-8 from the keyboard
input.


May be you need to specify -u8 option or utf8 resource?



That would work. So would running uxterm instead of xterm, but, all of
these mess-up command-line editing: Alt+key is converted into a char.
code instead of an ESC+key sequence.

R0ller, do this:

1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm):

export LANG=hu_HU.UTF-8
export LC_CTYPE=hu_HU.UTF-8
export LC_MESSAGES=hu_HU.UTF-8


2. In ~/.Xresources, tell xterm to use the current locale when generating
    chars.:

XTerm*locale: true

    The `-lc' option does the same thing. If using uxterm, the class-name
    becomes `UXTerm'.




On Fri, 20 Jan 2023, Robert Elz wrote:


I believe bash will take your current locale into account
when doing that [...]



That's correct. But as r0ller had a UTF-8 locale set, I didn't mention 
that.

However, it is better to be precise, so thank you!

-RVP


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-21 Thread r0ller
Guys, thanks for everyone! Problem solved :D I'm really grateful to all 
of you.


Best regards,
r0ller

On 1/21/23 12:26 AM, RVP wrote:

On Fri, 20 Jan 2023, Valery Ushakov wrote:


On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote:


Well, checking what printf results in, I get:

$printf 'n?z'|hexdump -C
  6e e9 7a  |n.z|
0003
$printf $'n\uE9z'|hexdump -C
  6e c3 a9 7a   |n..z|
0004

It's definitely different from what you got for 'n?z'. What does
that mean?


In the second example you specify \uE9 which is the unicode code point
for e with acute.  It is then uncondionally converted by printf to
UTF-8 (which is two bytes: 0xc3 0xa9) on output.

Your terminal input is in 8859-1 it seems.



That's it. The terminal emulator is not generating UTF-8 from the keyboard
input.


May be you need to specify -u8 option or utf8 resource?



That would work. So would running uxterm instead of xterm, but, all of
these mess-up command-line editing: Alt+key is converted into a char.
code instead of an ESC+key sequence.

R0ller, do this:

1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm):

export LANG=hu_HU.UTF-8
export LC_CTYPE=hu_HU.UTF-8
export LC_MESSAGES=hu_HU.UTF-8


2. In ~/.Xresources, tell xterm to use the current locale when generating
    chars.:

XTerm*locale: true

    The `-lc' option does the same thing. If using uxterm, the class-name
    becomes `UXTerm'.




On Fri, 20 Jan 2023, Robert Elz wrote:


I believe bash will take your current locale into account
when doing that [...]



That's correct. But as r0ller had a UTF-8 locale set, I didn't mention 
that.

However, it is better to be precise, so thank you!

-RVP


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-21 Thread RVP

On Fri, 20 Jan 2023, Thomas Dickey wrote:


May be you need to specify -u8 option or utf8 resource?



That would work. So would running uxterm instead of xterm, but, all of
these mess-up command-line editing: Alt+key is converted into a char.
code instead of an ESC+key sequence.


perhaps you're referring to eightBitInput (see manpage)



That seems to get set when running as uxterm/UTF-8 locale. However, 
`XTerm*metaSendsEscape: true' (which I have set for XTerm, but, not for

UXTerm) fixes things right back, so it's all OK :)


XTerm*locale: true


that's redundant, since the default "medium" will give the same effect :-)



Great! That's something I didn't know. Thanks, Tom.

Cheers,
-RVP


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-20 Thread Thomas Dickey
On Fri, Jan 20, 2023 at 11:26:49PM +, RVP wrote:
> On Fri, 20 Jan 2023, Valery Ushakov wrote:
> 
> > On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote:
> > 
> > > Well, checking what printf results in, I get:
> > > 
> > > $printf 'n?z'|hexdump -C
> > >   6e e9 7a  |n.z|
> > > 0003
> > > $printf $'n\uE9z'|hexdump -C
> > >   6e c3 a9 7a   |n..z|
> > > 0004
> > > 
> > > It's definitely different from what you got for 'n?z'. What does
> > > that mean?
> > 
> > In the second example you specify \uE9 which is the unicode code point
> > for e with acute.  It is then uncondionally converted by printf to
> > UTF-8 (which is two bytes: 0xc3 0xa9) on output.
> > 
> > Your terminal input is in 8859-1 it seems.
> > 
> 
> That's it. The terminal emulator is not generating UTF-8 from the keyboard
> input.
> 
> > May be you need to specify -u8 option or utf8 resource?
> > 
> 
> That would work. So would running uxterm instead of xterm, but, all of
> these mess-up command-line editing: Alt+key is converted into a char.
> code instead of an ESC+key sequence.

perhaps you're referring to eightBitInput (see manpage)

> R0ller, do this:
> 
> 1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm):
> 
> export LANG=hu_HU.UTF-8
> export LC_CTYPE=hu_HU.UTF-8
> export LC_MESSAGES=hu_HU.UTF-8

R0ller wasn't clear about whether this was done (outside the terminal).

Actually, R0ller didn't mention whether the terminal was the graphical
environment or the console (from the comments, I assumed the latter).
 
> 2. In ~/.Xresources, tell xterm to use the current locale when generating
>chars.:
> 
> XTerm*locale: true

that's redundant, since the default "medium" will give the same effect :-)
 
>The `-lc' option does the same thing. If using uxterm, the class-name
>becomes `UXTerm'.
> 
> 
> 
> 
> On Fri, 20 Jan 2023, Robert Elz wrote:
> 
> > I believe bash will take your current locale into account
> > when doing that [...]
> > 
> 
> That's correct. But as r0ller had a UTF-8 locale set, I didn't mention that.
> However, it is better to be precise, so thank you!
> 
> -RVP

-- 
Thomas E. Dickey 
https://invisible-island.net


signature.asc
Description: PGP signature


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-20 Thread RVP

On Fri, 20 Jan 2023, Valery Ushakov wrote:


On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote:


Well, checking what printf results in, I get:

$printf 'n?z'|hexdump -C
  6e e9 7a  |n.z|
0003
$printf $'n\uE9z'|hexdump -C
  6e c3 a9 7a   |n..z|
0004

It's definitely different from what you got for 'n?z'. What does
that mean?


In the second example you specify \uE9 which is the unicode code point
for e with acute.  It is then uncondionally converted by printf to
UTF-8 (which is two bytes: 0xc3 0xa9) on output.

Your terminal input is in 8859-1 it seems.



That's it. The terminal emulator is not generating UTF-8 from the keyboard
input.


May be you need to specify -u8 option or utf8 resource?



That would work. So would running uxterm instead of xterm, but, all of
these mess-up command-line editing: Alt+key is converted into a char.
code instead of an ESC+key sequence.

R0ller, do this:

1. Add your locale settings in ~/.xinitrc (or ~/.xsession if using xdm):

export LANG=hu_HU.UTF-8
export LC_CTYPE=hu_HU.UTF-8
export LC_MESSAGES=hu_HU.UTF-8


2. In ~/.Xresources, tell xterm to use the current locale when generating
   chars.:

XTerm*locale: true

   The `-lc' option does the same thing. If using uxterm, the class-name
   becomes `UXTerm'.




On Fri, 20 Jan 2023, Robert Elz wrote:


I believe bash will take your current locale into account
when doing that [...]



That's correct. But as r0ller had a UTF-8 locale set, I didn't mention that.
However, it is better to be precise, so thank you!

-RVP


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-20 Thread Valery Ushakov
On Fri, Jan 20, 2023 at 15:09:44 +0100, r0ller wrote:

> Well, checking what printf results in, I get:
> 
> $printf 'n?z'|hexdump -C
>   6e e9 7a  |n.z|
> 0003
> $printf $'n\uE9z'|hexdump -C
>   6e c3 a9 7a   |n..z|
> 0004
> 
> It's definitely different from what you got for 'n?z'. What does
> that mean?

In the second example you specify \uE9 which is the unicode code point
for e with acute.  It is then uncondionally converted by printf to
UTF-8 (which is two bytes: 0xc3 0xa9) on output.

Your terminal input is in 8859-1 it seems.  0xe9 in the first example
is "LATIN SMALL LETTER E WITH ACUTE" that is unicode code point \u00E9
which is encoded in latin-1 as 0xE9.  So your terminal inserted 0xe9
when you pressed that key.  May be you need to specify -u8 option or
utf8 resource?  (I'm mostly using netbsd headless, so I haven't been
following the current status of utf8 support in X).

-uwe


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-20 Thread r0ller

Well, checking what printf results in, I get:

$printf 'néz'|hexdump -C
  6e e9 7a  |n.z|
0003
$printf $'n\uE9z'|hexdump -C
  6e c3 a9 7a   |n..z|
0004

It's definitely different from what you got for 'néz'. What does that mean?

Thanks,
r0ller

On 1/20/23 9:55 AM, RVP wrote:

On Fri, 20 Jan 2023, r0ller wrote:

Thanks for your efforts to reproduce it :) I just don't get why it 
works for you with the same locales and why it doesn't for me.
Are there any other settings that affect encoding besides LC variables 
and LANG?




Since we seem to have the same flookup binary, check against the
magyar.fst I used:

https://github.com/r0ller/alice/tree/master/hi_android/foma

Next check that the input you're feeding to flookup actually _is_
UTF-8. Both /bin/sh and bash output UTF-8 if given Unicode code-
points in the form `\u'. So,

$ printf 'néz' | hexdump -C
  6e c3 a9 7a   |n..z|
0004
$ printf $'n\uE9z' | hexdump -C
  6e c3 a9 7a   |n..z|
0004
$

If that works, then check those UTF-8 bytes against whatever the
terminal emulator generated from your keystrokes for the `'
in `néz'.

-RVP


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-20 Thread Robert Elz
Date:Fri, 20 Jan 2023 08:55:45 + (UTC)
From:RVP 
Message-ID:  <4dd21c1f-f5c3-c3ba-96d8-cab73a0b...@sdf.org>

  | Both /bin/sh and bash output UTF-8 if given Unicode code-
  | points in the form `\u'. So,

I believe bash will take your current locale into account
when doing that, whereas neither /bin/sh nor /usr/bin/printf
do, they simply emit UTF-8 unconditionally.   This kind of
difference is (partly) why POSIX is not including the \u (or \U)
escape sequences in $'...' quoted strings in Issue 8.

Another is how the end of the  is detected, is it always
exactly 4 hex digits (or 8 for \U), or any number up to 4 (or
8) if followed by a non-hex char, or using as many hex chars
as exist?  To be portable (as input) such a string needs to
be exactly 4 (8) hex digits, and be followed by something
which is not a hex digit - the closing ' is often useful
there, it can always be followed immediately by $' to
resume quoting again (or just ' or " if those are adequate).
But that's just the input, you also need to be using a
locale using UTF-8 char encoding to get predictable output.

kre
  |
  | $ printf 'néz' | hexdump -C
  |   6e c3 a9 7a   |n..z|
  | 0004
  | $ printf $'n\uE9z' | hexdump -C
  |   6e c3 a9 7a   |n..z|
  | 0004
  | $
  |
  | If that works, then check those UTF-8 bytes against whatever the
  | terminal emulator generated from your keystrokes for the `'
  | in `néz'.
  |
  | -RVP
  |
  | --0-494486379-1674204946=:18222--
  |


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-20 Thread RVP

On Fri, 20 Jan 2023, r0ller wrote:


Thanks for your efforts to reproduce it :) I just don't get why it works for 
you with the same locales and why it doesn't for me.
Are there any other settings that affect encoding besides LC variables and LANG?



Since we seem to have the same flookup binary, check against the
magyar.fst I used:

https://github.com/r0ller/alice/tree/master/hi_android/foma

Next check that the input you're feeding to flookup actually _is_
UTF-8. Both /bin/sh and bash output UTF-8 if given Unicode code-
points in the form `\u'. So,

$ printf 'néz' | hexdump -C
  6e c3 a9 7a   |n..z|
0004
$ printf $'n\uE9z' | hexdump -C
  6e c3 a9 7a   |n..z|
0004
$

If that works, then check those UTF-8 bytes against whatever the
terminal emulator generated from your keystrokes for the `'
in `néz'.

-RVP


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread r0ller
Thanks for your efforts to reproduce it :) I just don't get why it works for 
you with the same locales and why it doesn't for me. Are there any other 
settings that affect encoding besides LC variables and LANG?Best 
regards,r0ller Eredeti levél Feladó: RVP Dátum: 
2023 január 19 12:44:25Tárgy: Re: sending/receiving UTF-8 characters from 
terminal to programCímzett: r0ller On Wed, 18 Jan 2023, 
r0ller wrote:  > echo néz|flookup magyar.fst > > it results in: > > néz +? 
> > However, when passing the string as: > > echo néz|flookup magyar.fst > > I 
get a successful analysis: > > néz    +swConsonant+néz[stem]+CON > néz    
+swConsonant+néz[stem]+CON+Nom > néz    néz[stem]+Verb+IndefSg3 >  That 
should work--and it does. With a just compiled flookup (from foma-0.9.18.tar.gz 
in the link you provided, and the .fst file got by googling):  ``` $ uname -a 
NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan  7 
15:04:01 UTC 2023  
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 $ 
export LANG=hu_HU.UTF-8 $ export LC_CTYPE=hu_HU.UTF-8 $ export 
LC_MESSAGES=hu_HU.UTF-8 $ /tmp/F/bin/flookup -v flookup 1.03 (foma library 
version 0.9.18alpha) $ echo néz | /tmp/F/bin/flookup 
alice-master/hi_android/foma/magyar.fst néz +swConsonant+néz[stem]+CON néz  
   +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3  $ echo néz 
| /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz+?  $ ```  
-RVP 

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread r0ller
Thanks! That's what I suspect as well but I just can't figure out how to fix 
it.Best regards,r0ller Eredeti levél Feladó: Rhialto 
Dátum: 2023 január 19 16:33:08Tárgy: Re: sending/receiving 
UTF-8 characters from terminal to programCímzett: RVP I think 
there is some encoding confusion going on here. I still use boring old Latin-1 
(iso 8859-1), and I saw the two occurrences of the word nz 
differently.  > > echo néz|flookup magyar.fst > > echo néz|flookup magyar.fst  
The first case with 1 letter in the middle looking like an e + aigu, the second 
time as 2 characters, probably an utf-8 encoding.  I'm not sure if this helps 
you directly but it may be a hint for something.  -Olaf. -- ___ "Buying carbon 
credits is a bit like a serial killer paying someone else to \X/  have kids to 
make his activity cost neutral." -The BOFHfalu.nl@rhialto 

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP

On Wed, 18 Jan 2023, r0ller wrote:


Actually, it's just inconvenient to always type the strings what
I want to analyse in a program, compile and execute it instead of
giving it a go from the shell.



In a pinch, you can always use iconv(1) to do your conversions:

```
$ printf néz | hexdump -# locale is en_GB.UTF-8 ie. UTF-8
  6e c3 a9 7a   |n..z|
0004
$ printf néz | iconv -f UTF-8 -t ISO-8859-1 | hexdump -C
  6e e9 7a  |n.z|
0003
$
```

-RVP

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP

On Thu, 19 Jan 2023, Rhialto wrote:


I think there is some encoding confusion going on here. I still use
boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the
word nz differently.


echo néz|flookup magyar.fst
echo néz|flookup magyar.fst


The first case with 1 letter in the middle looking like an e + aigu,
the second time as 2 characters, probably an utf-8 encoding.



Yeah, some kind of encoding mismatch is responsible. Everything worked
for me because

a) all the text I copy-pasted were UTF8 (even the 2nd one, which,
   unsurprisingly, didn't work.)

b) flookup was OK with UTF-8 input (or my converted UTF-8 input to match
   the text encoding in magyar.fst)

c) text in magyar.fst was in UTF-8/Unicode (or, if another encoding, then
   flookup did the conversion before doing the text lookup.)

b) and c) are educated guesses.

-RVP

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread Rhialto
I think there is some encoding confusion going on here. I still use
boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the
word nz differently.

> > echo néz|flookup magyar.fst
> > echo néz|flookup magyar.fst

The first case with 1 letter in the middle looking like an e + aigu,
the second time as 2 characters, probably an utf-8 encoding.

I'm not sure if this helps you directly but it may be a hint for
something.

-Olaf.
-- 
___ "Buying carbon credits is a bit like a serial killer paying someone else to
\X/  have kids to make his activity cost neutral." -The BOFHfalu.nl@rhialto


signature.asc
Description: PGP signature


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP

On Wed, 18 Jan 2023, r0ller wrote:


echo néz|flookup magyar.fst

it results in:

néz +?

However, when passing the string as:

echo néz|flookup magyar.fst

I get a successful analysis:

néz    +swConsonant+néz[stem]+CON
néz    +swConsonant+néz[stem]+CON+Nom
néz    néz[stem]+Verb+IndefSg3



That should work--and it does. With a just compiled flookup (from
foma-0.9.18.tar.gz in the link you provided, and the .fst file got
by googling):

```
$ uname -a
NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan  7 
15:04:01 UTC 2023  
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
$ export LANG=hu_HU.UTF-8
$ export LC_CTYPE=hu_HU.UTF-8
$ export LC_MESSAGES=hu_HU.UTF-8
$ /tmp/F/bin/flookup -v
flookup 1.03 (foma library version 0.9.18alpha)
$ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst
néz +swConsonant+néz[stem]+CON
néz +swConsonant+néz[stem]+CON+Nom
néz néz[stem]+Verb+IndefSg3

$ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst
néz+?

$
```

-RVP


sending/receiving UTF-8 characters from terminal to program

2023-01-18 Thread r0ller
Hi All,My locale is as follows:export LANG="hu_HU.UTF-8"export 
LC_CTYPE="hu_HU.UTF-8"export LC_MESSAGES="hu_HU.UTF-8"The problem is that even 
though I can type the special characters of the locale everywhere (though it's 
funny in the terminal where they only appear after the third key press) in X 
but when echoing a string containing such chars like (áéíóöőúű) and piping them 
to a program, they arrive somehow differently compared to when the same string 
is passed by calling the same program. I don't have any other example than foma 
(https://fomafst.github.io). It has a tool called flookup which requires 
special morphological dictionaries which most probably noone uses here (except 
me) but when passing a string from the command line like:echo néz|flookup 
magyar.fstit results in:néz +?However, when passing the string as:echo 
néz|flookup magyar.fstI get a successful analysis:néz    
+swConsonant+néz[stem]+CONnéz    +swConsonant+néz[stem]+CON+Nomnéz    
néz[stem]+Verb+IndefSg3When calling the api function behind flookup from a 
program passing the string 'néz', I also get the analysis successfully. I don't 
have a clear explanation for this (only partially) and I also wonder why the 
terminal does not translate the locale special utf-8 bytes back to a character 
when they're printed by the program. Actually, it's just inconvenient to always 
type the strings what I want to analyse in a program, compile and execute it 
instead of giving it a go from the shell.Could anyone explain me what happens 
here and how I can handle it?Thanks,r0ller