Peter Kirk [mailto:[EMAIL PROTECTED] writes: > Why is this a problem? Quotes and ">" with combining marks are > presumably not legal HTML or XML;
You're wrong: it is legal in both HTML and XML. What is not specified correctly is the behavior of HTML and XML parsers face to a XML or HTML document claiming it is coded with a Unicode encoding scheme or any other Unicode-compatible CES (like GB18030, but not completely with MacRoman as it contains supplementary characters that are not part of the Unicode/ISO/IEC 10646 repertoire). > and so the interpretation of a quotes > or ">" followed by combining marks as a quote or ">" and a defective > combining sequence is unambiguous, surely? No it is not: there's a problem of prevalence between XML/HTML/SGML parsing rules, and Unicode parsing rules. Using character entities can solve this problem, but I would really prefer that the W3 accepts a modification of its parsing rules so that any text element or attribute value starting by a defective combining sequence MUST NOT be interpreted as such using the simple encoding scheme. If a XML document is serialized into a text file with a encoding scheme, the generated file should (I would prefer "must") not encoding these defective sequences with the encoding scheme, but with character references only. This would allow to use the exactly SAME text parser used in Unicode as the input for the lexical and grammatical analysis of the XML/HTML/SGML parser. Within that model, the sequence ">" + combining character would be seen as a single combining sequence coding a abstract character, that breaks the syntax of expected end of tags. Same thing for the quotes delimiting the start of attribute values or for the square bracket delimiting the start of a CDATA section. > There could of course be > problems if there were any precomposed combinations of quotes or ">" > with combining characters, but I don't think there are any, are there? There are such precomposed sequences in Unicode. Look in NormalizationTest.txt for the places where ">", single and double quotes are used and part of a combining sequence... Notably look at sequences made with the combining solidus overlay; add also the case of enclosing combining characters, and of mathematical operators that can be created with a combining sequence starting by ">" or "=" or single or double quotes, and modified by diacritics. > Your proposed solution to the problem is messy in requiring the use of > numeric entities, and unnecessary. This is not that messy. Also I did not say that numeric entities must be used. Any parsed named entity could be used as well. This is not a problem of the Unicode standard, but a problem of the SGML, HTML 4.01, and XML standards. For SGML and HTML up to 4.01, you also have problems with the equal sign (because the quotes around element's attribute values are not mandatory, unlike in XML). We don't have this problem for element names or attribute names, because they must obey a stricter syntax and can't be any arbitrary Unicode string: these names cannot contain defective combining sequences simply because combining characters cannot be identifier starts. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

