To add a factor that I think hasn't been mentioned, there are languages in which apostrophe is used both as a letter by itself and as part of a complex letter. Most of the native languages of British Columbia write glottalized consonants as C+', e.g. <t'> for an ejective alveolar stop, and many use apostrophe by itself for the glottal stop. (Another common convention, which produces other difficulties, is to use the number <7> for glottal stop.)
Bill On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy <[email protected]> wrote: > On 4/Jun/2015 14:34 PM, Markus Scherer wrote: >> >> Looks all wrong to me. >> > Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your > points below. > > > >> You can't use simple regular expressions to find word boundaries. That's >> why we have UAX #29. >> > > And UAX #29 doesn't work for words which begin or end with apostrophes, > whether represented by U+0027 or U+2019. It erroneously thinks there's a > word boundary between the apostrophe and the rest of the word. > > But UAX #29 *would* work if the apostrophes were represented by U+02BC, > which is what I'm suggesting. > > Confusion between apostrophe and quoting -- blame the scribe who came up >> with the ambiguous use, not the people who gave it a number. >> > I'm not trying to blame anyone. I'm trying to fix the problem. > > I know this problem has a long history. > > English is taught as that squiggle being punctuation, not a letter. >> > I think we need make a distinction between the colloquial usage of the > word "punctuation" and the Unicode general category "punctuation" which has > specific technical implications. > > I somewhat wish that Unicode had a separate category for "Things that look > like punctuation but behave like letters", which might clear up this > taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF > RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are > actually modifiers, into that category too.) But we don't. And the English > apostrophe behaves like a letter, regardless of what your primary school > teacher might have told you, so with the options available in Unicode, it > needs to be classed as a letter. > > "don’t" is a contraction of two words, it is not one word. >> > This is utter nonsense. Should my spell-checker recognise "hasn't" as a > valid word? Or should it consider "hasn't" to be the word "hasn" followed > by the word "t", and then flag both of them as spelling errors? > > Is "fo'c'sle" the three separate words "fo", "c", and "sle"? > > The idea that words with apostrophes aren't valid words is a regrettable > myth that exists in English, which has repeatedly led to the apostrophe > being an afterthought in computing, leading to situations like this one. > > If anything, Unicode might have made a mistake in encoding two of these >> that look identical. How are normal users supposed to find both U+2019 >> and >> U+02BC on their keyboards, and how are they supposed to deal with >> incorrect >> usage? >> > Yeah, and there are fonts where I can't tell the difference between > capital I and lower-case l. But my spell-checker will underline a word > where I erroneously use an I instead of an l, and I imagine spell-checkers > of the future could underline a word where I erroneously use a closing > quote instead of an apostrophe, or vice versa. > > There are other possible solutions too, but I don't want to get into a > discussion about UI design. I'll leave that to UI designers. > > - Ted >

