Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread r0ller
Thanks for your efforts to reproduce it :) I just don't get why it works for 
you with the same locales and why it doesn't for me. Are there any other 
settings that affect encoding besides LC variables and LANG?Best 
regards,r0ller Eredeti levél Feladó: RVP Dátum: 
2023 január 19 12:44:25Tárgy: Re: sending/receiving UTF-8 characters from 
terminal to programCímzett: r0ller On Wed, 18 Jan 2023, 
r0ller wrote:  > echo néz|flookup magyar.fst > > it results in: > > néz +? 
> > However, when passing the string as: > > echo néz|flookup magyar.fst > > I 
get a successful analysis: > > néz    +swConsonant+néz[stem]+CON > néz    
+swConsonant+néz[stem]+CON+Nom > néz    néz[stem]+Verb+IndefSg3 >  That 
should work--and it does. With a just compiled flookup (from foma-0.9.18.tar.gz 
in the link you provided, and the .fst file got by googling):  ``` $ uname -a 
NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan  7 
15:04:01 UTC 2023  
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 $ 
export LANG=hu_HU.UTF-8 $ export LC_CTYPE=hu_HU.UTF-8 $ export 
LC_MESSAGES=hu_HU.UTF-8 $ /tmp/F/bin/flookup -v flookup 1.03 (foma library 
version 0.9.18alpha) $ echo néz | /tmp/F/bin/flookup 
alice-master/hi_android/foma/magyar.fst néz +swConsonant+néz[stem]+CON néz  
   +swConsonant+néz[stem]+CON+Nom néz néz[stem]+Verb+IndefSg3  $ echo néz 
| /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst néz+?  $ ```  
-RVP 

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread r0ller
Thanks! That's what I suspect as well but I just can't figure out how to fix 
it.Best regards,r0ller Eredeti levél Feladó: Rhialto 
Dátum: 2023 január 19 16:33:08Tárgy: Re: sending/receiving 
UTF-8 characters from terminal to programCímzett: RVP I think 
there is some encoding confusion going on here. I still use boring old Latin-1 
(iso 8859-1), and I saw the two occurrences of the word nz 
differently.  > > echo néz|flookup magyar.fst > > echo néz|flookup magyar.fst  
The first case with 1 letter in the middle looking like an e + aigu, the second 
time as 2 characters, probably an utf-8 encoding.  I'm not sure if this helps 
you directly but it may be a hint for something.  -Olaf. -- ___ "Buying carbon 
credits is a bit like a serial killer paying someone else to \X/  have kids to 
make his activity cost neutral." -The BOFHfalu.nl@rhialto 

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP

On Wed, 18 Jan 2023, r0ller wrote:


Actually, it's just inconvenient to always type the strings what
I want to analyse in a program, compile and execute it instead of
giving it a go from the shell.



In a pinch, you can always use iconv(1) to do your conversions:

```
$ printf néz | hexdump -# locale is en_GB.UTF-8 ie. UTF-8
  6e c3 a9 7a   |n..z|
0004
$ printf néz | iconv -f UTF-8 -t ISO-8859-1 | hexdump -C
  6e e9 7a  |n.z|
0003
$
```

-RVP

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP

On Thu, 19 Jan 2023, Rhialto wrote:


I think there is some encoding confusion going on here. I still use
boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the
word nz differently.


echo néz|flookup magyar.fst
echo néz|flookup magyar.fst


The first case with 1 letter in the middle looking like an e + aigu,
the second time as 2 characters, probably an utf-8 encoding.



Yeah, some kind of encoding mismatch is responsible. Everything worked
for me because

a) all the text I copy-pasted were UTF8 (even the 2nd one, which,
   unsurprisingly, didn't work.)

b) flookup was OK with UTF-8 input (or my converted UTF-8 input to match
   the text encoding in magyar.fst)

c) text in magyar.fst was in UTF-8/Unicode (or, if another encoding, then
   flookup did the conversion before doing the text lookup.)

b) and c) are educated guesses.

-RVP

Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread Rhialto
I think there is some encoding confusion going on here. I still use
boring old Latin-1 (iso 8859-1), and I saw the two occurrences of the
word nz differently.

> > echo néz|flookup magyar.fst
> > echo néz|flookup magyar.fst

The first case with 1 letter in the middle looking like an e + aigu,
the second time as 2 characters, probably an utf-8 encoding.

I'm not sure if this helps you directly but it may be a hint for
something.

-Olaf.
-- 
___ "Buying carbon credits is a bit like a serial killer paying someone else to
\X/  have kids to make his activity cost neutral." -The BOFHfalu.nl@rhialto


signature.asc
Description: PGP signature


Re: sending/receiving UTF-8 characters from terminal to program

2023-01-19 Thread RVP

On Wed, 18 Jan 2023, r0ller wrote:


echo néz|flookup magyar.fst

it results in:

néz +?

However, when passing the string as:

echo néz|flookup magyar.fst

I get a successful analysis:

néz    +swConsonant+néz[stem]+CON
néz    +swConsonant+néz[stem]+CON+Nom
néz    néz[stem]+Verb+IndefSg3



That should work--and it does. With a just compiled flookup (from
foma-0.9.18.tar.gz in the link you provided, and the .fst file got
by googling):

```
$ uname -a
NetBSD x202e.localdomain 9.3_STABLE NetBSD 9.3_STABLE (GENERIC) #0: Sat Jan  7 
15:04:01 UTC 2023  
mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64
$ export LANG=hu_HU.UTF-8
$ export LC_CTYPE=hu_HU.UTF-8
$ export LC_MESSAGES=hu_HU.UTF-8
$ /tmp/F/bin/flookup -v
flookup 1.03 (foma library version 0.9.18alpha)
$ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst
néz +swConsonant+néz[stem]+CON
néz +swConsonant+néz[stem]+CON+Nom
néz néz[stem]+Verb+IndefSg3

$ echo néz | /tmp/F/bin/flookup alice-master/hi_android/foma/magyar.fst
néz+?

$
```

-RVP