Re: unicode confusion--An answer

Charles Hixson via Digitalmars-d-learn Mon, 04 Apr 2016 17:26:15 -0700

I was writing my output to two different files. Only one of them wasset to utf-8, the other must have been some other encoding, because whenI set the encoding to utf-8 everything cleared up.


On 04/04/2016 04:04 PM, Charles Hixson via Digitalmars-d-learn wrote:

Well, at least I think that it's unicode confusion. When a storevalues into a string (in an array of structs) and then compare itagainst itself, it compares fine, and if I write it out at that pointit writes out fine. And validate says it's good unicode.
But later...
valid = true, len = 17, wrd = true , cnt =     2, txt = Gesammtabentheuer
valid = true, len = 27, wrd = true , cnt = 1, txt =Î½ÎµÏÎµÎ»Î·Î³ÎµÏá½³ÏÎ·Ï
valid = true, len = 17, wrd = true , cnt =     1, txt = Î¶Î·ÏÎ¿á¿¦ÏÎ¹Î½
valid = true, len = 36, wrd = true , cnt = 1, txt =Î±á¼±Î¼Î¿ÏÏÎ¿ÏÎ´Î¿ÎºÎ±á½»ÏÏÎ·Ïvalid = true, len = 18, wrd = true , cnt = 2, txt =Î´ÏÎ½Î·Î¸ÏÎ¼ÎµÎ½valid = true, len = 20, wrd = true , cnt = 1, txt =ÏÏÎ¿ÏÎºÏÎ¿ÏÏÏvalid = true, len = 20, wrd = true , cnt = 1, txt =ÏÎºÎ¿ÏÏÎ¼ÎÎ½Î·Î½valid = true, len = 18, wrd = true , cnt = 1, txt =á¼Î³Î±ÏÎ·ÏÎ¿á½·valid = true, len = 28, wrd = true , cnt = 1, txt =×Ö½Ö·×Ö°×Ö´×Ö¼Ö¸×ªÖ¸×Ö¼valid = true, len = 19, wrd = true , cnt = 1, txt =Î¤ÏÏÏÎ·Î½Î¹Îºá½±
valid = true, len = 17, wrd = true , cnt =     2, txt = IODOHYDRARGYRATIS
valid = true, len = 21, wrd = true , cnt = 1, txt =ÏÎ¿Î¹Î½Î¹Îºá½·ÏÎ¹Î½
valid = true, len = 17, wrd = true , cnt =     1, txt = Spectrophotometer
valid = true, len = 26, wrd = true , cnt = 1, txt =Î±á¼°Î½Î¹ÏÏá½¹Î¼ÎµÎ½Î¿Î¹valid = true, len = 70, wrd =true , cnt = 1, txt = ÎÎÎ£ÎÎÎÎÎ¡ÎÎÎÎ£Î§ÎÎÎÎÎÎ¡ÎÎÎ£ÎÎÎÎ¥ÎÎ ÎÎ¤ÎÎ£ÎÎÎvalid = true, len = 18, wrd = true , cnt = 1, txt =Î¼Î¹ÎºÏÏÏÎ±ÏÎ±valid = true, len = 23, wrd = true , cnt = 1, txt =á¼ÏÎ¿Ïá½±ÏÎ·Ïá½·Î½valid = true, len = 18, wrd = true , cnt = 1, txt =××Ö¹×§Ö°×©×Öµ×
valid = true, len = 17, wrd = true , cnt =     1, txt = Î´Î¹Î±Î¼á½³Î½ÏÎ½
     . . . (etc. for 39599 lines)
(And it looks worse than that, actually, because control charactersaren't coming through).I think the originals were usually greek letters due to an earliertest (why there should be so many greek words I don't know...but ifthey're there I want them to be handled properly), but the corruptedtext is such a small part of the original file that I can't becertain. Valid = true means that it passed string validates rightbefore being printed. wrd = true means that the only characters in itshould be isAlpha, hyphen, apostrophe, or underscore. cnt = n meansthat it was detected n times in the dataset (of 8013 text files). Andthe string in each struct is only written once in the execution of theprogram.
I was scanning the dataset looking to see what long words werevalid...I didn't expect THIS at all. And as you can see from, e.g.,"Spectrophotometer", ASCII values don't seem to be damaged at all.
FWIW, I was expecting to encounter an occasional Greek, French, orChinese word...but nothing like this. I'd think it was the conversionfrom string to dchar[] and back that was the problem, but when I testimmediately after I know I've written to the string everything looksright. So I'm guessing it's something about how unicode is handled.

Re: unicode confusion--An answer

Reply via email to