On Thu, May 17, 2012 at 4:29 PM, Richard Wordingham < [email protected]> wrote:
> On Thu, 17 May 2012 15:42:37 -0700 > Markus Scherer <[email protected]> wrote: > > > On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham < > > [email protected]> wrote: > > >> HOWEVER, you must *not* have the added contraction for 0F71+0F71. > > > If we don't have this prefix contraction, then we will miss a > > discontiguous-contraction match on <0F71, 0334, 0F71, 0F72>. > > (a) <0F71, 0334, 0F71, 0F72> is not FCD. > Sorry, more coffee for me next time... It's still possible to have FCD text that requires a discontiguous match for the contraction 0F71+0F71+0F72. The text would add one more 0F71 at the beginning which would have to be skipped, but the match fails if the prefix contraction is missing. (b) CE(<0F71, 0334, 0F71, 0F72>) = CE(0F71+0F72).CE(0334).CE(0F71). > > (c) Are you thinking of <0FB2, 0334, 0F71, 0F80>, with *REVERSED* I? > I wasn't specifically thinking of that... As I've already said, DUCET 6.1.0 omits a contraction for 0FB2+0F71, and > so CE(<0FB2, 0334, 0F71, 0F80>) = CE(0FB2+0F80).CE(0334).CE(0F71), and a > strictly non-normalising tailoring therefore needs a contraction > for 0FB2+0334+0F71+F80 = 0FB2+0334+0F81 to (i) strip the 0F80 from 0F81 > and (ii) prevent the contraction 0FB2+0F81. Ok, but assuming we didn't add 0FB2+0F71, why can't we add the contraction 0FB2+0F81 and have the 0334 and any other non-starter be handled via discontiguous matching? And assuming we do add 0FB2+0F71 as requested in L2/12-131R, do we need infinite overlap contractions? See this spreadsheet<https://docs.google.com/spreadsheet/pub?key=0Ag3w_MjvUEoRdFVabUR5elltX3pObXNYRnV5VWNiRGc&output=html> . lccc(0F73) = ccc(0F71) = 129 > rccc(0F73) = ccc(0F72) = 130 > > However, if we do not allow 0F71,0F71,0F71,0F73 to contract as > 0F71+0F73,0F71,0F71, we need infinitely many contractions to handle > pure (albeit highly dubious) Tibetan. We have to treat 0F73 as not > being blocked by 0F71. > This is not clear to me, but I see an issue which might be what you are trying to say. The DUCET has the contraction 0F71+0F72, and we should find a discontiguous match on <0F71, 0F71, 0F71, 0F72> skipping the two middle 0F71. That string is equivalent to the FCD-passing string <0F71, 0F71, 0F73> but there is no 0F72 in sight there to complete the match if we don't modify the string. If we cannot find a way to handle this with a finite (actually, small) amount of data, then we either have to decompose those three Tibetan composite vowels before they reach the core collation code, or, frankly, we just document a limitation for ICU and point to the fact that the use of these three characters is "discouraged"<http://unicode.org/charts/PDF/U0F00.pdf>and they don't occur in any normalized text (e.g., NFC). The more I think about these the more I believe I could live with such a limitation. If we could get our code to support all of UCA, provide a dozen runtime attributes, compare strings and return two kinds of sort keys, be fast, and deliver correct results on all FCD input except if these three characters are involved, I would be quite happy. Maybe we could lobby to change these characters to be "strongly discouraged" or "deprecated" or "too hard to implement"... markus -- Google Internationalization Engineering

