RE: New version of TR29:

Marco Cimarosti Wed, 14 Aug 2002 10:54:17 -0700

[[ I feel very ashamed to confess that I have found more errors in the
rules. :-(((
My humblest apologies: version 3 is attached. I have re-read it a few times
and found no more problems. ]]

Philipp Reichmuth wrote:
> Hello Marco,
> 
> Your definition of "LatinVowel" is problematic. Is "Y" only a vowel in
> French? In a word such as "yeux", it certainly is a consonant. Could
> this lead to problems?

I don't think so, but I wait for the opinion of French speakers.

What I can see is that things like "l'yaourt" [lja'ur] are normal in French
spelling, and sometimes are to be found also in Italian ("l'yoghurt"
['ljogurt]).

Consonants [j] and [w] have the special status of "semivowels" in romance
languages, which means that they often behave as vowels do, including in the
rules for elision.

In fact, I wondered whether also "J" and "W" should be included, to catch
some old Italian usages like "l'Jugoslavia" or "l'whisky".

> Defining such classes has the problem that they easily appear too
> general. The mere name "LatinVowel" looks too much like this class was
> supposed to contain all vowels of the Latin script regardless of
> language, but these wouldn't obviously be limited to your selection.
>
> You have to make this really clear. It is *so* tempting to assume that
> these are all the possible vowels that somebody is probably going to
> do it and base some completely non-apostrophe-related algorithm on it,
> just because he can easily extract this information from some Unicode
> data.

I assumed that only those few vowels would be used in French or Italian, in
combination with elision. That's why I have excluded things like Dutch "IJ"
or IPA vowels. Including these vowels is bringing no benefit to French or
Italian, but risks to collide with some unanticipated usage in some other
language.

But, of course, I am aware that there are edge cases that will not be
captured in the general case. I have named one of these edge cases (the
Breton trigraph "c'h"), but it's not difficult to come up with more -- e.g.,
when the apostrophe is used as a diacritic applied to consonants (such as
the Wade-Giles romanization of Chinese "K'ang-hsi").

This is also true (and accounted for) with the current definition of the
UTR, but I found that the ubiquitous French and Italian "l'" and "d'", etc.
cannot be seen as "edge cases".

BTW, notice that I didn't include precomposed accented letters because I
understand UTR#29 works on NFD normalized text.

> Better name them something less potentially misleading like
> ItalianFrenchVowel, if you need this character class - it also better
> reflects the purpose of the thing.

That's is fine. I just wanted to suggest a possibility, not to substitute
the UTC's work in defining the precise wording of their documents.

However, "ItalianFrenchVowel" doesn't include Esperanto, Occitan and many
Italian and French dialects.

_ Marco

Proposal to accomodate French/Italian apostrophes in Unicode's UTR#29

author: Marco Cimarosti
date: August 14, 2002
version: 3

Rationale - The existing word-boundary rules of UTR#29 are designed to capture the meaning of apostrophes in English (and many other languages). I.e., apostrophes normally are inside a word, as in "don't" or "Marco's". The behavior of apostrophes is quite different in Italian and French (and other languages, e.g. Esperanto), where an apostrophe normally marks the deletion of the last vowel of a word which occurs before a word starting with a vowel, e.g. "d'Unicode" (d' from de = "of"), or "l'Angleterre" (l' from la = "the"). The two words are graphically joined (no space before or after the apostrophe). The apostrophe is part of the first word, and an implicit word break comes after it. Implementing this behavior in the default definition of UTR#29 is important to accomodate the needs of the large French and Italian speaking communities, as well as the needs of the people writing in other languages, who often use loanwords or quotations from these popular languages.

Proposed euristic - The present proposal is based on the observation that French-style "splitting" apostrophes are always followed by a vowel, whereas English-style "joining" apostrophes are normally followed by a consonant. The issue is complicated by the fact that both French and Italian have mute H's which can interfere in the algorithm. The proposal defines three new character classes: LatinVowels (containing all vowels meaningful in French, Italian, and Esperanto), LatinH (containing only the letter H in the two cases), and Apostrophe (containing the two characters used for apostrophe). The characters contained in the new classes are removed from the classes where the they used to belong (ALetter and MidLetter). The new classes are used to define two new rules (before current rule 6) for French-style apostrophes, which cover the cases"<consonant>'<vowel>" and "<consonant>'h<vowel>". Rules from 5 downwards are slightly changed because the former classes ALetter and MidLetter are now split in two or more classes.

Open issues - Although this proposal might enhance the handling of some common cases in two common languages, there still are many remaining edge cases which can only be solved by tailoring the algorithm. For instance, the "c'h" trigraph of the Breton language would unduely be splitted by the default definition, when followed by a vowel.

Note - The proposed changes are concentrated in Table 2 (Default Word Boundaries). Proposed additions are colored in green, proposed deletions are colored in red, and existing text remains in black.

...

Table 2. Default Word Boundaries

Character Classes
Format	General_Category = Format (Cf)
Katakana	Script = KATAKANA, or Any of the following: U+30FC # KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF70 # HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF9E..U+FF9F # HALFWIDTH KATAKANA SOUND MARKS
LatinVowel	Any of the following: U+0041, U+0061 # LATIN CAPITAL/SMALL LETTER A U+0045, U+0065 # LATIN CAPITAL/SMALL LETTER E U+0049, U+0069 # LATIN CAPITAL/SMALL LETTER I U+004F, U+006F # LATIN CAPITAL/SMALL LETTER O U+0055, U+0075 # LATIN CAPITAL/SMALL LETTER U U+0059, U+0079 # LATIN CAPITAL/SMALL LETTER Y U+00C6, U+00E6 # LATIN CAPITAL/SMALL LETTER AE U+0152, U+0153 # LATIN CAPITAL/SMALL LIGATURE OE
LatinH	Any of the following: U+0048, U+0068 # LATIN CAPITAL/SMALL LETTER H
ALetter	Alphabetic = true, or Any of the following modifier letters: U+02B9..U+02BA # PRIME..DOUBLE PRIME U+02C2..U+02CF # LEFT ARROWHEAD..LOW ACUTE ACCENT U+02D2..U+02DF # CENTRED RIGHT HALF RING..CROSS ACCE U+02E5..U+02ED # EXTRA-HIGH TONE BAR..UNASPIRATED U+05F3 (׳) geresh and not Ideographic = true and not Katakana = true and not Script = Thai and not Script = Lao and not Script = Hiragana and not listed in LatinVowel and not listed in LatinH
Apostrophe	U+0027 (') apostrophe U+2019 (’) curly apostrophe
MidLetter	Any of the following: U+0027 (') apostrophe U+00AD () soft hyphen U+05F4 (״) gershayim U+2019 (’) curly apostrophe
MidNumLet	Any of the following: U+002E (.) period U+003A (:) colon (used in Swedish)
MidNum	Line_Break = Infix_Numeric and not MidNumLet = true
other	Other categories are from Line_Break (using the long names from PropertyAliases)

Rules
Break at the start and end of text.
sot	÷		(1)
	÷	eot	(2)
Treat a grapheme cluster as if it were a single character: the first character of the cluster.
GC	→	FB	(3)
Ignore interior Format characters. That is, ignore Format characters in all subsequent rules (except the last rule).
X Format*	→	X	(4)
Do not break between most letters.
(ALetter \| LatinVowel \| LatinH)	×	(ALetter \| LatinVowel \| LatinH)	(5)
Break after an apostrophe following a consonant and preceding a Latin vowel (possibly precede by a mute H).
(ALetter \| LatinH) Apostrophe	÷	LatinVowel	(5.a)
(ALetter \| LatinH) Apostrophe	÷	LatinH LatinVowel	(5.b)
Do not break letters across certain punctuation.
(ALetter \| LatinVowel \| LatinH)	×	(MidLetter \| MidNumLet \| Apostrophe) (ALetter \| LatinVowel \| LatinH)	(6)
(ALetter \| LatinVowel \| LatinH) (MidLetter \| MidNumLet \| Apostrophe )	×	(ALetter \| LatinH)	(7)
Do not break within sequences of digits, or digits adjacent to letters ('3a', or 'A3').
Numeric	×	Numeric	(8)
(ALetter \| LatinVowel \| LatinH)	×	Numeric	(9)
Numeric	×	(ALetter \| LatinVowel \| LatinH)	(10)
Do not break within sequences like: ‘3.2’ or '3,456.789'.
Numeric (MidNum \| MidNumLet)	×	Numeric	(11)
Numeric	×	(MidNum \| MidNumLet) Numeric	(12)
Do not break between Katakana.
Katakana	×	Katakana	(13)
Otherwise, break everywhere (including around ideographs).
Any	÷	Any	(14)

...

RE: New version of TR29:

Proposal to accomodate French/Italian apostrophes in Unicode's UTR#29

Reply via email to