Re: Proposed Draft UTR #31 - Syntax Characters

Peter Kirk Tue, 26 Aug 2003 10:38:13 +0000

On 26/08/2003 00:07, [EMAIL PROTECTED] wrote:

I'm afraid that's not very practical, because, you see, if I have a
hypothetical compiler for some hypothetical programming-language, and I
download some source-code from the internet and try to complile it, I expect
one of two things, either (1) it will compile cleanly, or (2) I will have to
UPGRADE my compiler (or version of Unicode), after which it will compile
cleanly.

I don't expect, however, to have to DOWNgrade my version of Unicode. And I
can't be expected to store EVERY numbered version of Unicode on my machine.

I prefer the idea that the list of allowed identifier characters increases
with each version of Unicode (or equivalently, that a list of excluded
characters decreases with each version of Unicode).

Agreed. I thought I had made this clear though perhaps some of the clarification was off-list. My preference is for a list of syntax (operator) characters which can be added to but not subtracted from. This should avoid any need to downgrade.

I would also suggest that all punctuation characters and all undefined characters be reserved i.e. they should not be used unquoted in strings as they may be defined as syntax characters in later versions. Implementations would not be obliged to check for misuse of these reserved characters, it is up to the user to avoid them. (This kind of loose syntax may not be ideal but it is common practice e.g. with HTML which most browsers do not fully validate. An implementation would be free to check against the list of reserved characters in the current UCD if preferred.) But a guarantee could be made that characters currently defined in Unicode as non-punctuation will never be defined as syntax characters.

My suggestion is actually rather similar to what is already written in UTR #31 section 4:

With a fixed set of whitespace and syntax code points, a pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. By using this policy, it preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results.

The difference is that I am extending the list of possible syntax characters to all punctuation characters. And perhaps a subset of these theoretically possible syntax characters can be defined as the allowed syntax characters in any one version of Unicode. But perhaps this isn't necessary, as each pattern language can define and check for its own subset as long as it only uses defined punctuation characters.

The reason why a change is needed is mainly to avoid the ethnocentric definition of only Latin punctuation characters as valid syntax characters. I also have also seen the serious problems which have resulted from premature freezing of inappropriate properties e.g. the combining classes of Hebrew points.

I am making these points in an official submission to the review process.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Proposed Draft UTR #31 - Syntax Characters

Reply via email to