date:20230119

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread r0ller

Thanks for your efforts to reproduce it :) I just don't get why it works for 
you with the same locales and why it doesn't for me. Are there any other 
settings that affect encoding besides LC variables and LANG?Best 
regards,r0ller Eredeti levél Feladó: RVP Dátum: 
2023 január 19 12:44:25Tárgy: Re: sending/receiving UTF-8 characters from 
terminal to programCímzett: r0ller On Wed, 18 Jan 2023, 
r0ller wrote:  > echo néz|flookup magyar.fst > > it results in: > > néz +? 
> > However, when passing the string as: > > echo nÃ©z|flookup magyar.fst > > I 
get a successful analysis: > > nÃ©z    +swConsonant+nÃ©z[stem]+CON > nÃ©z    
+swConsonant+nÃ©z[stem]+CON+Nom > nÃ©z    nÃ©z[stem]+Verb+IndefSg3 >  That 
should work--and it does. With a just compiled flookup (from foma-0.9.18.tar.gz 
in the link you provided, and the .fst file got by googling):  ``` $ uname -a 
NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan  7 
15:04:01 UTC 2023  
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 $ 
export LANG=hu_HU.UTF-8 $ export LC_CTYPE=hu_HU.UTF-8 $ export 
LC_MESSAGES=hu_HU.UTF-8 $ /tmp/F/bin/flookup -v flookup 1.03 (foma library 
version 0.9.18alpha) $ echo néz | /tmp/F/bin/flookup 
alice-master/hi_android/foma/magyar.fst néz +swConsonant+néz[stem]+CON néz  
   +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3  $ echo nÃ©z 
| /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst nÃ©z+?  $ ```  
-RVP

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread r0ller

Thanks! That's what I suspect as well but I just can't figure out how to fix 
it.Best regards,r0ller Eredeti levél Feladó: Rhialto 
Dátum: 2023 január 19 16:33:08Tárgy: Re: sending/receiving 
UTF-8 characters from terminal to programCímzett: RVP I think 
there is some encoding confusion going on here. I still use boring old Latin-1 
(iso 8859-1), and I saw the two occurrences of the word nz 
differently.  > > echo néz|flookup magyar.fst > > echo nÃ©z|flookup magyar.fst  
The first case with 1 letter in the middle looking like an e + aigu, the second 
time as 2 characters, probably an utf-8 encoding.  I'm not sure if this helps 
you directly but it may be a hint for something.  -Olaf. -- ___ "Buying carbon 
credits is a bit like a serial killer paying someone else to \X/  have kids to 
make his activity cost neutral." -The BOFHfalu.nl@rhialto

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP


On Wed, 18 Jan 2023, r0ller wrote:


Actually, it's just inconvenient to always type the strings what
I want to analyse in a program, compile and execute it instead of
giving it a go from the shell.



In a pinch, you can always use iconv(1) to do your conversions:

```
$ printf néz | hexdump -# locale is en_GB.UTF-8 ie. UTF-8
  6e c3 a9 7a   |n..z|
0004
$ printf néz | iconv -f UTF-8 -t ISO-8859-1 | hexdump -C
  6e e9 7a  |n.z|
0003
$
```

-RVP

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP


On Thu, 19 Jan 2023, Rhialto wrote:


I think there is some encoding confusion going on here. I still use
boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the
word nz differently.


echo néz|flookup magyar.fst
echo nÃ©z|flookup magyar.fst


The first case with 1 letter in the middle looking like an e + aigu,
the second time as 2 characters, probably an utf-8 encoding.



Yeah, some kind of encoding mismatch is responsible. Everything worked
for me because

a) all the text I copy-pasted were UTF8 (even the 2nd one, which,
   unsurprisingly, didn't work.)

b) flookup was OK with UTF-8 input (or my converted UTF-8 input to match
   the text encoding in magyar.fst)

c) text in magyar.fst was in UTF-8/Unicode (or, if another encoding, then
   flookup did the conversion before doing the text lookup.)

b) and c) are educated guesses.

-RVP

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread Rhialto

I think there is some encoding confusion going on here. I still use
boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the
word nz differently.

> > echo néz|flookup magyar.fst
> > echo nÃ©z|flookup magyar.fst

The first case with 1 letter in the middle looking like an e + aigu,
the second time as 2 characters, probably an utf-8 encoding.

I'm not sure if this helps you directly but it may be a hint for
something.

-Olaf.
-- 
___ "Buying carbon credits is a bit like a serial killer paying someone else to
\X/  have kids to make his activity cost neutral." -The BOFHfalu.nl@rhialto


signature.asc
Description: PGP signature

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP


On Wed, 18 Jan 2023, r0ller wrote:


echo néz|flookup magyar.fst

it results in:

néz +?

However, when passing the string as:

echo nÃ©z|flookup magyar.fst

I get a successful analysis:

nÃ©z    +swConsonant+nÃ©z[stem]+CON
nÃ©z    +swConsonant+nÃ©z[stem]+CON+Nom
nÃ©z    nÃ©z[stem]+Verb+IndefSg3



That should work--and it does. With a just compiled flookup (from
foma-0.9.18.tar.gz in the link you provided, and the .fst file got
by googling):

```
$ uname -a
NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan  7 
15:04:01 UTC 2023  
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
$ export LANG=hu_HU.UTF-8
$ export LC_CTYPE=hu_HU.UTF-8
$ export LC_MESSAGES=hu_HU.UTF-8
$ /tmp/F/bin/flookup -v
flookup 1.03 (foma library version 0.9.18alpha)
$ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst
néz +swConsonant+néz[stem]+CON
néz +swConsonant+néz[stem]+CON+Nom
néz néz[stem]+Verb+IndefSg3

$ echo nÃ©z | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst
nÃ©z+?

$
```

-RVP

Re: sending/receiving UTF-8 characters from terminal to program

Re: sending/receiving UTF-8 characters from terminal to program

Re: sending/receiving UTF-8 characters from terminal to program

Re: sending/receiving UTF-8 characters from terminal to program

Re: sending/receiving UTF-8 characters from terminal to program

Re: sending/receiving UTF-8 characters from terminal to program

6 matches

Site Navigation

Mail list logo

Footer information