On Sat, 2 Jun 2018 14:33:01 -0600 Doug Ewell via Unicode <unicode@unicode.org> wrote:
> Richard Wordingham wrote: > > >> What about U+200B ZWSP? > > > > Thanks for the suggestion, but it's not likely to work: > > Are you asking what schemes exist, or are you trying to call > attention to some rendering engine and/or font that doesn't render a > combination as it should? I'm asking what exists, or is reasonably supposed to exist. > This is too general for me to parse. Can you replace these > hypotheticals with actual characters, using code points, or at least > with actual General Categories? For example, an 'Mc' followed by ZWSP > followed by an 'Lo' displays like such-and-so. The code points would > be best. On Sun, 3 Jun 2018 09:26:40 +0900 "Martin J. Dürst via Unicode" <unicode@unicode.org> wrote: > My question goes a bit further than to Doug's: Why would you want to > do such a thing? Are there actual scripts/languages where line breaks > within grapheme clusters occur? If yes, what are there? Can you show > actual examples, e.g. scans of documents,...? Three examples are given on p230 of the dissertation "Buddhist Monks and their Search for Knowledge: an examination of the personal collection of manuscripts of Phra Khamchan Virachitto (1920-2007), Abbot of Vat Saen Sukharam, Luang Prabang" by Bounleuth Sengsoulin, available at http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf . The text is in Lao in the Tham script. The transcriptions in the text are transliterated to the Lao script. The first example, transliterated to Lao, is ເມຽ, which one could encode as <U+0EC0 LAO VOWEL SIGN E, U+00AD SOFT HYPHEN, U+0EA1 LAO LETTER MO, U+0EBD LAO SEMIVOWEL SIGN NYO>, provided the soft hyphen had no visual representation beyond the line break. (Strictly, it's a break for a hole for a string.) The third example is likewise ໄຫວ <U+0EC4 LAO VOWEL SIGN AI, U+00AD SOFT HYPHEN, U+0EAB LAO LETTER HO SUNG, U+0EA7 LAO LETTER WO>. (I can't make out the second example.) However, the text is actually in the Tham script, and without any line-breaking controls, the first and third examples read, marking the grapheme cluster boundaries with '|', as ᨾ᩠ᨿᩮ <U+1A3E TAI THAM LETTER MA, U+1A60 TAI THAM SIGN SAKOT | U+1A3F TAI THAM LETTER LOW YA, U+1A6E TAI THAM VOWEL SIGN E> and ᩉ᩠ᩅᩱ <U+1A4C TAI THAM LETTER LOW HA, U+1A60 TAI THAM SIGN SAKOT | U+1A45 TAI THAM LETTER WA, U+1A71 TAI THAM VOWEL SIGN AI>. The internal grapheme cluster boundaries are purely stopping points for cursor movement; they correspond to nothing graphical and to nothing in user conception. The natural internal boundaries are just before the vowels, which are written on the left, and between the base and subscript characters, i.e. before U+1A60. There seem to be Northern Thai Pali examples in the proposal L2/2007-007 at the end of https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf Figure 9a Page 2 Line 3, and at the end of Figure 9b Page 1 Line 2, but I can't read the Pali well enough to be sure that the apparent visually line-final instances of TAI THAM SIGN E are not just scribal blunders. Reverting to Doug's reply: > > Incidentally, does CLDR define the rendering of soft hyphen, or is > > one entirely at the mercy of the application? > Why would this be a CLDR thing? Because the rendering is quite likely to depend on locale. I had always understood that Thai did not mark breaks in words - and then I discovered them in the Royal Institute Dictionary! The correct German rendering of soft hyphens has recently changed. There are also subtle effects when Dutch words are hyphenated. These rules are not the same as for English, but Unicode tends not to deal in dependencies finer than a script. Richard.