Richard has given some cogent arguments below.

Another counter example is the use of ":" to form abbreviations in Swedish. (It's inserted in the word to replace the elided part). In that use, this punctuation character is suddenly part of a "word".

To handle the full set of general case, word recognition has to be plenty smart (and context or environment sensitive). The basic, untailored "default" word breaking algorithm will only ever do the plain vanilla cases right.

Basing decisions about encoding of characters on the failings of such simple minded algorithms is really a non-starter. (The few existing exceptions just prove the rule).

A./

On 3/9/2013 6:52 PM, Richard Wordingham wrote:
On Sat, 09 Mar 2013 16:21:17 -0700
Karl Williamson <[email protected]> wrote:

Rendering is not the only consideration.  Processing textual content
for 0387 is broken because it is considered to be an ID_Continue
character, whereas its Greek usage is equivalent to the English
semicolon, something that would never occur in the middle of a word
nor an identifier.
ID_Continue is for processing things like variable names.  How does
allowing U+0387 in variable names cause problems in the processing of
text?

How would ID_continue allow you to process English «foc’s’le» or
«co-operate»?  The default word boundary determination has been
tailored to give you the right results,and should work for Greek unless
you are working with scripta continua, in which case you have massive
problems regardless.

Note also that word boundary determination is intended to be
tailorable, which would allow one to exclude U+00B7 and U+0387 from
words or deal with miscoded accents and breathings physically at the
start of a word beginning with a capitalised vowel. One should also be
able to tailor it to deal with word final apostrophes - though doing
that in the CLDR style could be computationally excessive if the text
may contain quoting apostrophes.  One might even tailor it to allow
Greek «ὅ,τι», depending on whether one wishes to count it as a word.

Richard.





Reply via email to