> On 08 Feb 2016, at 20:10, Markus Scherer <[email protected]> wrote: > > On Mon, Feb 8, 2016 at 10:47 AM, James Tauber <[email protected]> wrote: > Even with all this, though, my own work includes accentuation and > syllabification algorithms, all of which are made more cumbersome by the lack > of precomposed characters indicating vowel length. I'm currently leaning > towards adding a layer of "character" processing on top of Python 3's > otherwise decent support that effectively treats the relevant character > sequences as single characters even if they aren't (and can't be precomposed). > > I suggest you normalize the text (NFC or NFD), and then look for "grapheme > clusters". http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries > > In C++ and Java, you could use an ICU BreakIterator for the latter.
Might I suggest looking at Rakudo Perl 6’s implementation of NFG (Normalization Form Grapheme) which will generate synthetic codepoints on the fly under the hood. For an introduction, see http://jnthn.net/papers/2015-spw-nfg.pdf Liz

