Re: Compare string with German umlauts

2020-05-19 Thread Martin Tschierschke via Digitalmars-d-learn
On Monday, 18 May 2020 at 14:28:33 UTC, Steven Schveighoffer 
wrote:


What you need is to normalize the data for comparison: 
https://dlang.org/phobos/std_uni.html#normalize


For more reference: 
https://en.wikipedia.org/wiki/Combining_character


-Steve


I checked it again but could not reproduce the original error, it 
somehow seems that my compare string contained another error. But 
nevertheless good to know how to deal with encoding errors!




Re: Compare string with German umlauts

2020-05-19 Thread Martin Tschierschke via Digitalmars-d-learn
On Monday, 18 May 2020 at 14:28:33 UTC, Steven Schveighoffer 
wrote:

On 5/18/20 9:44 AM, Martin Tschierschke wrote:

[...]


using == on strings is going to compare the exact bits for 
equality. In unicode, things can be encoded differently to make 
the same grapheme. For example, ö is a code unit that is the o 
with a diaeresis (U+00F6). But you could encode it with 2 code 
points -- a standard o, and then an diaeresis combining 
character (U+006F, U+0308)


What you need is to normalize the data for comparison: 
https://dlang.org/phobos/std_uni.html#normalize

Thank you, I will check that.



Re: Compare string with German umlauts

2020-05-19 Thread Martin Tschierschke via Digitalmars-d-learn

On Monday, 18 May 2020 at 14:22:31 UTC, WebFreak001 wrote:
[...]
It solved the problem, but what is the right way to use 
umlauts (encode them) inside the program?


Your code should have already worked like that, assuming your 
input file is a UTF-8 file. Check with an editor like Notepad++ 
or Visual Studio Code what the actual encoding of your text 
file is. In D all strings you specify in source are UTF-8 bytes 
in the end and a byte-by-byte comparison like with your line == 
"..." will cause it to fail if line is not UTF-8.

Thank you, I will check your hints!


Re: Compare string with German umlauts

2020-05-18 Thread Steven Schveighoffer via Digitalmars-d-learn

On 5/18/20 9:44 AM, Martin Tschierschke wrote:

Hi,
I have to find a certain line in a file, with a text containing umlauts.

How do you do this?

The following was not working:

foreach(i,line; file){
  if(line=="My text with ö oe, ä ae or ü"){
    writeln("found it at line",i)
  }
}

I ended up using line.canFind("with part of the text without umlaut").

It solved the problem, but what is the right way to use umlauts (encode 
them) inside the program?




using == on strings is going to compare the exact bits for equality. In 
unicode, things can be encoded differently to make the same grapheme. 
For example, ö is a code unit that is the o with a diaeresis (U+00F6). 
But you could encode it with 2 code points -- a standard o, and then an 
diaeresis combining character (U+006F, U+0308)


What you need is to normalize the data for comparison: 
https://dlang.org/phobos/std_uni.html#normalize


For more reference: https://en.wikipedia.org/wiki/Combining_character

-Steve


Re: Compare string with German umlauts

2020-05-18 Thread WebFreak001 via Digitalmars-d-learn

On Monday, 18 May 2020 at 13:44:15 UTC, Martin Tschierschke wrote:

Hi,
I have to find a certain line in a file, with a text containing 
umlauts.


How do you do this?

The following was not working:

foreach(i,line; file){
 if(line=="My text with ö oe, ä ae or ü"){
   writeln("found it at line",i)
 }
}

I ended up using line.canFind("with part of the text without 
umlaut").


It solved the problem, but what is the right way to use umlauts 
(encode them) inside the program?


Your code should have already worked like that, assuming your 
input file is a UTF-8 file. Check with an editor like Notepad++ 
or Visual Studio Code what the actual encoding of your text file 
is. In D all strings you specify in source are UTF-8 bytes in the 
end and a byte-by-byte comparison like with your line == "..." 
will cause it to fail if line is not UTF-8.


My guess is that your file is most likely gonna be encoded in 
Windows-1251 or Windows-1252. To quickly check if it is UTF-8, 
print out your strings but with separators between each character 
like using `writefln("%(%s, %)", line.byUTF!dchar);` and see if 
it is actually 'M', 'y', ' ', 't', 'e', 'x', 't', ' ', 'w', 'i', 
't', 'h', ' ', 'ö', ' ', 'o', 'e', ',', ' ', 'ä', ' ', 'a', 'e', 
' ', 'o', 'r', ' ', 'ü'


If you have identified that the character encoding is indeed your 
problem, interpret your line with the correct character encoding 
using


import std.encoding;
Windows1252String win1252Line = cast(Windows1252String)line;

and then convert that to utf8:

string utf8Line;
transcode(win1252Line, utf8Line);

and then compare that with your input string:

if (line == "My text with ö oe, ä ae or ü") { ... }



Alternatively you can also change your comparison string to be in 
Windows 1251/1252 encoding if you know that all your files will 
have this encoding, but I would advise against that and instead 
somehow figure out the encoding based on common German characters 
or an external library/program and always convert to UTF-8 for 
all text operations.


Another tip: if you perform case-insensitive comparision with 
UTF-8, use std.uni : sicmp or icmp (sicmp is faster / less 
advanced) and use like `sicmp(strA, strB) == 0` where you replace 
== with < or > if you want to sort. Note that this is not bound 
to any locale and is rather the invariant locale. You will 
probably want to use OS APIs or third party libraries to do 
locale based text operations (like text in UI)


Compare string with German umlauts

2020-05-18 Thread Martin Tschierschke via Digitalmars-d-learn

Hi,
I have to find a certain line in a file, with a text containing 
umlauts.


How do you do this?

The following was not working:

foreach(i,line; file){
 if(line=="My text with ö oe, ä ae or ü"){
   writeln("found it at line",i)
 }
}

I ended up using line.canFind("with part of the text without 
umlaut").


It solved the problem, but what is the right way to use umlauts 
(encode them) inside the program?