> On 08 Feb 2016, at 20:10, Markus Scherer <[email protected]> wrote:
> 
> On Mon, Feb 8, 2016 at 10:47 AM, James Tauber <[email protected]> wrote:
> Even with all this, though, my own work includes accentuation and 
> syllabification algorithms, all of which are made more cumbersome by the lack 
> of precomposed characters indicating vowel length. I'm currently leaning 
> towards adding a layer of "character" processing on top of Python 3's 
> otherwise decent support that effectively treats the relevant character 
> sequences as single characters even if they aren't (and can't be precomposed).
> 
> I suggest you normalize the text (NFC or NFD), and then look for "grapheme 
> clusters".  http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> 
> In C++ and Java, you could use an ICU BreakIterator for the latter.

Might I suggest looking at Rakudo Perl 6’s implementation of NFG (Normalization 
Form Grapheme) which will generate synthetic codepoints on the fly under the 
hood.

For an introduction, see http://jnthn.net/papers/2015-spw-nfg.pdf



Liz

Reply via email to