I was writing my output to two different files. Only one of them was
set to utf-8, the other must have been some other encoding, because when
I set the encoding to utf-8 everything cleared up.
On 04/04/2016 04:04 PM, Charles Hixson via Digitalmars-d-learn wrote:
Well, at least I think that it's unicode confusion. When a store
values into a string (in an array of structs) and then compare it
against itself, it compares fine, and if I write it out at that point
it writes out fine. And validate says it's good unicode.
But later...
valid = true, len = 17, wrd = true , cnt = 2, txt = Gesammtabentheuer
valid = true, len = 27, wrd = true , cnt = 1, txt =
νεÏεληγεÏá½³ÏηÏ
valid = true, len = 17, wrd = true , cnt = 1, txt = ζηÏοῦÏιν
valid = true, len = 36, wrd = true , cnt = 1, txt =
αἱμοÏÏοÏδοκαύÏÏηÏ
valid = true, len = 18, wrd = true , cnt = 2, txt =
δÏ
νηθÏμεν
valid = true, len = 20, wrd = true , cnt = 1, txt =
ÏÏοÏκÏοÏÏÏ
valid = true, len = 20, wrd = true , cnt = 1, txt =
ÏκοÏÏμÎνην
valid = true, len = 18, wrd = true , cnt = 1, txt =
á¼Î³Î±ÏηÏοί
valid = true, len = 28, wrd = true , cnt = 1, txt =
×Ö½Ö·×Ö°×Ö´×ָּתָ×Ö¼
valid = true, len = 19, wrd = true , cnt = 1, txt =
ΤÏ
ÏÏηνικά
valid = true, len = 17, wrd = true , cnt = 2, txt = IODOHYDRARGYRATIS
valid = true, len = 21, wrd = true , cnt = 1, txt =
ÏοινικίÏιν
valid = true, len = 17, wrd = true , cnt = 1, txt = Spectrophotometer
valid = true, len = 26, wrd = true , cnt = 1, txt =
αἰνιÏÏόμενοι
valid = true, len = 70, wrd =
true , cnt = 1, txt = ÎÎΣÎ
ÎÎÎΡÎÎÎΣΧÎÎÎÎÎΡÎÎΣÎÎÎÎ¥ÎÎ ÎΤÎΣÎÎÎ
valid = true, len = 18, wrd = true , cnt = 1, txt =
μικÏÏÏαÏα
valid = true, len = 23, wrd = true , cnt = 1, txt =
á¼ÏοÏá½±ÏηÏίν
valid = true, len = 18, wrd = true , cnt = 1, txt =
××ֹקְש×Öµ×
valid = true, len = 17, wrd = true , cnt = 1, txt = διαμένÏν
. . . (etc. for 39599 lines)
(And it looks worse than that, actually, because control characters
aren't coming through).
I think the originals were usually greek letters due to an earlier
test (why there should be so many greek words I don't know...but if
they're there I want them to be handled properly), but the corrupted
text is such a small part of the original file that I can't be
certain. Valid = true means that it passed string validates right
before being printed. wrd = true means that the only characters in it
should be isAlpha, hyphen, apostrophe, or underscore. cnt = n means
that it was detected n times in the dataset (of 8013 text files). And
the string in each struct is only written once in the execution of the
program.
I was scanning the dataset looking to see what long words were
valid...I didn't expect THIS at all. And as you can see from, e.g.,
"Spectrophotometer", ASCII values don't seem to be damaged at all.
FWIW, I was expecting to encounter an occasional Greek, French, or
Chinese word...but nothing like this. I'd think it was the conversion
from string to dchar[] and back that was the problem, but when I test
immediately after I know I've written to the string everything looks
right. So I'm guessing it's something about how unicode is handled.