On Mon, 9 Dec 2013, Philip Taylor wrote:
> Keith -- could you possible supply an example of
> "a properly encoded utf-8 string" from which it
> can be unambiguously determined whether the string
> "sang" is an English word (the past tense of "sing")

I'll probably regret pointing this out, and the characters involved have
been deprecated since Unicode 5, but:

   U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067

or in UTF-8 bytes:

   f3 a0 80 81 f3 a0 81 a5 f3 a0 81 ae 73 61 6e 67

The Web form you mentioned sanitizes away the special characters.  I don't
think that's unique to "tags" - it seems to also block everything outside
the Basic Multilingual Plane.  Bad form for something claiming to be an
authoritative analyser of Unicode strings.
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Reply via email to