I was writing my output to two different files. Only one of them was set to utf-8, the other must have been some other encoding, because when I set the encoding to utf-8 everything cleared up.

On 04/04/2016 04:04 PM, Charles Hixson via Digitalmars-d-learn wrote:
Well, at least I think that it's unicode confusion. When a store values into a string (in an array of structs) and then compare it against itself, it compares fine, and if I write it out at that point it writes out fine. And validate says it's good unicode.

But later...
valid = true, len = 17, wrd = true , cnt =     2, txt = Gesammtabentheuer
valid = true, len = 27, wrd = true , cnt = 1, txt = νεφεληγερέτης
valid = true, len = 17, wrd = true , cnt =     1, txt = ζητοῦσιν
valid = true, len = 36, wrd = true , cnt = 1, txt = αἱμορροϊδοκαύστης valid = true, len = 18, wrd = true , cnt = 2, txt = δυνηθώμεν valid = true, len = 20, wrd = true , cnt = 1, txt = προσκρούσω valid = true, len = 20, wrd = true , cnt = 1, txt = σκοτωμένην valid = true, len = 18, wrd = true , cnt = 1, txt = ἀγαπητοί valid = true, len = 28, wrd = true , cnt = 1, txt = הַֽמְזִמָּתָהּ valid = true, len = 19, wrd = true , cnt = 1, txt = Τυρρηνικά
valid = true, len = 17, wrd = true , cnt =     2, txt = IODOHYDRARGYRATIS
valid = true, len = 21, wrd = true , cnt = 1, txt = χοινικίσιν
valid = true, len = 17, wrd = true , cnt =     1, txt = Spectrophotometer
valid = true, len = 26, wrd = true , cnt = 1, txt = αἰνιττόμενοι valid = true, len = 70, wrd = true , cnt = 1, txt = ΓΗΣΠΛΕΘΡΑΔΙΣΧΙΛΙΑΕΡΓΑΣΙΜΟΥΑΠΟΤΗΣΟΜΟ valid = true, len = 18, wrd = true , cnt = 1, txt = μικρότατα valid = true, len = 23, wrd = true , cnt = 1, txt = ἀποπάτησίν valid = true, len = 18, wrd = true , cnt = 1, txt = מוֹקְשֵׁי
valid = true, len = 17, wrd = true , cnt =     1, txt = διαμένων
     . . . (etc. for 39599 lines)
(And it looks worse than that, actually, because control characters aren't coming through). I think the originals were usually greek letters due to an earlier test (why there should be so many greek words I don't know...but if they're there I want them to be handled properly), but the corrupted text is such a small part of the original file that I can't be certain. Valid = true means that it passed string validates right before being printed. wrd = true means that the only characters in it should be isAlpha, hyphen, apostrophe, or underscore. cnt = n means that it was detected n times in the dataset (of 8013 text files). And the string in each struct is only written once in the execution of the program.

I was scanning the dataset looking to see what long words were valid...I didn't expect THIS at all. And as you can see from, e.g., "Spectrophotometer", ASCII values don't seem to be damaged at all.

FWIW, I was expecting to encounter an occasional Greek, French, or Chinese word...but nothing like this. I'd think it was the conversion from string to dchar[] and back that was the problem, but when I test immediately after I know I've written to the string everything looks right. So I'm guessing it's something about how unicode is handled.


  • unicode confusion Charles Hixson via Digitalmars-d-learn
    • Re: unicode confusion--An answ... Charles Hixson via Digitalmars-d-learn

Reply via email to