Richard Wordingham wrote: > European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be used > with variation selectors. As their primary purpose is for use with > u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail to > recognise strings of digits with variation selectors as representing > numbers?
Legitimate for *what*? I supposed you could say that if a process claims to recognize strings of digits with variation selectors as representing numbers, then it would not be legitimate for that process to fail to do so. Conversely, if a process does not claim to recognize strings of digits with variation selectors as representing numbers, then it would be legitimate (and expected) for that process to fail to do so. Recognizing "numbers" is really outside the scope of the Unicode Standard, although admittedly it is not outside the scope of LDML, which does need to recognize numeric formats for localization. > > If not, it seems that I will have to raise this as an issue for LDML, > as it affects parsing and collation with the numeric tailoring. By default, at least, the presence of variation selectors shouldn't affect searching or collation. It seems to me that the more significant issue here would be whether the enclosing combining marks are present, whether or not any variation selectors are present. So: <U+0031, U+20E3, U+0032, U+20E3> Isn't much different, for this purpose, than: <U+0031, U+FE0F, U+20E3, U+0032, U+FE0F, U+20E3> I wouldn't really expect most processes to recognize either of those sequences as "a number" for parsing purposes. But if your issue here is worrying about whether the presence of variation selectors would screw up collation with numeric tailoring, it seems to me that is really an extreme edge case of an edge case, anyway. My expectation would be, rather, that if you are planning to do anything really significant with numbers, you'd have to have a fully tokenizing parser, anyway, at which point you assign some appropriate numeric value to your token and do something significant with it thereafter. Modifying such a parser to either a) ignore the presence of variation selectors (or any other format control characters), or b) treat the presence of variation selectors (or any other format control characters) as an error, ought to be relatively routine. If, on the other hand, you are doing "numeric tailoring" for collation by using ICU-style tailoring rules and expecting the string comparison routine for collation to produce numerical ordering, I would think that is likely to not be very robust, and could be complicated by all sorts of format issues, not merely the presence or absence of variation selectors. I don't see this as fundamentally any different for the variation selectors than it would be for other format controls. So what would you be doing, for example, with "numbers" like the following: 123456 versus 123<ZWJ>456 versus 123<LRM>456 --Ken

