> [Change Made] > I've now added this clarification to Summary of Attribute Values > > <GH>I returned to the definitions in Unicode now, and think now that > "character" is too vague. Unicode has in its glossary 4 different meanings > of character, and some of them certainly can result in multiple code points. > So, I hope you have formulated something that very reliably tells that we > count code points. > > Even this description is hard to evaluate the nomenclature from: > http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#G2212
[Change Made] I am pleased to say 0.6 is vastly clearer about the Unicode nomenclature now. Version 0.6 has just been published at http://www.xmpp.org/extensions/xep-0301.html Though "character" is often ambiguous, other Unicode documents use "character" terminology to describe a single code point: - unicode.org -- FAQ http://www.unicode.org/faq/basic_q.html -- have questions that interchangeably refers character as code points. - unicode.org glossary -- Definition (3) of word "Character" and definition (2) of "Code Point" is compatible with allowing a documents to specifically refer to the equivalence. - RFC5198 "Unicode Format for Network Interchange" -- http://tools.ietf.org/html/rfc5198 -- specifically defines character terminology as a code point, but continues to use the word "character" So, these changes were madE: http://xmpp.org/extensions/xep-0301.html#attribute_values Section 4.5.2 Attribute Values "For the purpose of this specification, the word "character" represents a single Unicode code point. See [[[Unicode Character Counting(link)]]]. http://xmpp.org/extensions/xep-0301.html#accurate_processing_of_action_elements Section 4.7 Accurate Processing of Action Elements -- It reads a little less arduously -- I now reference RFC5198, and clearly mention that character. -- I now reference Normalization Form C (which is in widespread use for networking including XMPP anyway, and is a default on many OS platforms). -- Section 4.7 is still big, but the guidelines have been vastly clarified to reduce misunderstandings. -- First two sentence of "Unicode Character Counting" now behaves as a quick definition "For this specification, a "character" represents a single Unicode code point. This is the same definition used in section 1.1 of IETF RFC 5198 [11]." Also, all this wordy Unicode-related stuff has now been moved to the bottom of Protocol, keeping the spec easier to read and tidier, while keeping the important (arduous-to-read but unfortunately necessary) "devil-in-the-details" stuff for extended reading near the bottom of the Protocol section. (The "4. Protocol" section is only 1/4 the size of the rest of the document). At the same time, the terminology has been made more user-friendly and compatible with widespread usage, and the handy RFC5198 provides me a convenient reference. Thanks, Mark Rejhon
