Hi list, As of Unicode 5.1, the MidNumLet Word_Break property value (apostrophe-alike + dot-alike characters) caused sequences like < (ALetter)+ MidNumLet (ALetter)+ > to be treated like a single word. Whilst it seems to be an improvement in handling words like "can’t" or "aujourd’hui", it also causes a regression in handling words separated with dot (e.g. domain names -- http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/63311, mistyped text -- "hi.there", or navigating through the code -- "struct.member" (yeah, I know this is out of scope of the default algorithm, but still), and so on). And the worst thing is that the default algorithm now specifies a sentence break in the middle of a word. As for example "mr.Hamster" - there are two sentences due to rule SB8 but still a single word due to rules WB6-WB7.
A possible solution (a simple one) is to map some or all of those dot-alike characters (FULL STOP, ONE DOT LEADER, SMALL FULL STOP, and FULLWIDTH FULL STOP) back to MidNum Word_Break property value. Another possible solution I see is to split ALetter into something like Upper, Lower, and OLetter, to map those dot-alike characters to some new Term Word_Break property value (mostly the same as the Sentence_Break property values), and to extend the word breaking rules so that no breaks will be allowed within sequences like < Upper x Term x Upper (Term)? > and < Lower x Term x Lower (Term)? > (possibly surrounded with < (¬(Upper | Lower | OLetter))* > ?). What do you think? P.S. I'd really wish unicode.org has a bug tracker so that one would be able to report, search, and watch issues like this. Kind regards, Konstantin

