On Sat, 09 Mar 2013 16:21:17 -0700 Karl Williamson <[email protected]> wrote:
> Rendering is not the only consideration. Processing textual content > for 0387 is broken because it is considered to be an ID_Continue > character, whereas its Greek usage is equivalent to the English > semicolon, something that would never occur in the middle of a word > nor an identifier. ID_Continue is for processing things like variable names. How does allowing U+0387 in variable names cause problems in the processing of text? How would ID_continue allow you to process English «foc’s’le» or «co-operate»? The default word boundary determination has been tailored to give you the right results,and should work for Greek unless you are working with scripta continua, in which case you have massive problems regardless. Note also that word boundary determination is intended to be tailorable, which would allow one to exclude U+00B7 and U+0387 from words or deal with miscoded accents and breathings physically at the start of a word beginning with a capitalised vowel. One should also be able to tailor it to deal with word final apostrophes - though doing that in the CLDR style could be computationally excessive if the text may contain quoting apostrophes. One might even tailor it to allow Greek «ὅ,τι», depending on whether one wishes to count it as a word. Richard.

