On Mon, 9 Dec 2013, Philip Taylor wrote: > Keith -- could you possible supply an example of > "a properly encoded utf-8 string" from which it > can be unambiguously determined whether the string > "sang" is an English word (the past tense of "sing")
I'll probably regret pointing this out, and the characters involved have been deprecated since Unicode 5, but: U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067 or in UTF-8 bytes: f3 a0 80 81 f3 a0 81 a5 f3 a0 81 ae 73 61 6e 67 The Web form you mentioned sanitizes away the special characters. I don't think that's unique to "tags" - it seems to also block everything outside the Basic Multilingual Plane. Bad form for something claiming to be an authoritative analyser of Unicode strings. -- Matthew Skala msk...@ansuz.sooke.bc.ca People before principles. http://ansuz.sooke.bc.ca/ -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex