Re: sending/receiving UTF-8 characters from terminal to program
Thanks for your efforts to reproduce it :) I just don't get why it works for you with the same locales and why it doesn't for me. Are there any other settings that affect encoding besides LC variables and LANG?Best regards,r0ller Eredeti levél Feladó: RVP Dátum: 2023 január 19 12:44:25Tárgy: Re: sending/receiving UTF-8 characters from terminal to programCímzett: r0ller On Wed, 18 Jan 2023, r0ller wrote: > echo néz|flookup magyar.fst > > it results in: > > néz +? > > However, when passing the string as: > > echo néz|flookup magyar.fst > > I get a successful analysis: > > néz +swConsonant+néz[stem]+CON > néz +swConsonant+néz[stem]+CON+Nom > néz néz[stem]+Verb+IndefSg3 > That should work--and it does. With a just compiled flookup (from foma-0.9.18.tar.gz in the link you provided, and the .fst file got by googling): ``` $ uname -a NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan 7 15:04:01 UTC 2023 mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 $ export LANG=hu_HU.UTF-8 $ export LC_CTYPE=hu_HU.UTF-8 $ export LC_MESSAGES=hu_HU.UTF-8 $ /tmp/F/bin/flookup -v flookup 1.03 (foma library version 0.9.18alpha) $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz +swConsonant+néz[stem]+CON néz +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3 $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz+? $ ``` -RVP
Re: sending/receiving UTF-8 characters from terminal to program
Thanks! That's what I suspect as well but I just can't figure out how to fix it.Best regards,r0ller Eredeti levél Feladó: Rhialto Dátum: 2023 január 19 16:33:08Tárgy: Re: sending/receiving UTF-8 characters from terminal to programCímzett: RVP I think there is some encoding confusion going on here. I still use boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the word nz differently. > > echo néz|flookup magyar.fst > > echo néz|flookup magyar.fst The first case with 1 letter in the middle looking like an e + aigu, the second time as 2 characters, probably an utf-8 encoding. I'm not sure if this helps you directly but it may be a hint for something. -Olaf. -- ___ "Buying carbon credits is a bit like a serial killer paying someone else to \X/ have kids to make his activity cost neutral." -The BOFHfalu.nl@rhialto
Re: sending/receiving UTF-8 characters from terminal to program
On Wed, 18 Jan 2023, r0ller wrote: Actually, it's just inconvenient to always type the strings what I want to analyse in a program, compile and execute it instead of giving it a go from the shell. In a pinch, you can always use iconv(1) to do your conversions: ``` $ printf néz | hexdump -# locale is en_GB.UTF-8 ie. UTF-8 6e c3 a9 7a |n..z| 0004 $ printf néz | iconv -f UTF-8 -t ISO-8859-1 | hexdump -C 6e e9 7a |n.z| 0003 $ ``` -RVP
Re: sending/receiving UTF-8 characters from terminal to program
On Thu, 19 Jan 2023, Rhialto wrote: I think there is some encoding confusion going on here. I still use boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the word nz differently. echo néz|flookup magyar.fst echo néz|flookup magyar.fst The first case with 1 letter in the middle looking like an e + aigu, the second time as 2 characters, probably an utf-8 encoding. Yeah, some kind of encoding mismatch is responsible. Everything worked for me because a) all the text I copy-pasted were UTF8 (even the 2nd one, which, unsurprisingly, didn't work.) b) flookup was OK with UTF-8 input (or my converted UTF-8 input to match the text encoding in magyar.fst) c) text in magyar.fst was in UTF-8/Unicode (or, if another encoding, then flookup did the conversion before doing the text lookup.) b) and c) are educated guesses. -RVP
Re: sending/receiving UTF-8 characters from terminal to program
I think there is some encoding confusion going on here. I still use boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the word nz differently. > > echo néz|flookup magyar.fst > > echo néz|flookup magyar.fst The first case with 1 letter in the middle looking like an e + aigu, the second time as 2 characters, probably an utf-8 encoding. I'm not sure if this helps you directly but it may be a hint for something. -Olaf. -- ___ "Buying carbon credits is a bit like a serial killer paying someone else to \X/ have kids to make his activity cost neutral." -The BOFHfalu.nl@rhialto signature.asc Description: PGP signature
Re: sending/receiving UTF-8 characters from terminal to program
On Wed, 18 Jan 2023, r0ller wrote: echo néz|flookup magyar.fst it results in: néz +? However, when passing the string as: echo néz|flookup magyar.fst I get a successful analysis: néz +swConsonant+néz[stem]+CON néz +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3 That should work--and it does. With a just compiled flookup (from foma-0.9.18.tar.gz in the link you provided, and the .fst file got by googling): ``` $ uname -a NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan 7 15:04:01 UTC 2023 mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 $ export LANG=hu_HU.UTF-8 $ export LC_CTYPE=hu_HU.UTF-8 $ export LC_MESSAGES=hu_HU.UTF-8 $ /tmp/F/bin/flookup -v flookup 1.03 (foma library version 0.9.18alpha) $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz +swConsonant+néz[stem]+CON néz +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3 $ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz+? $ ``` -RVP