On Fri, 13 Jan 2017 01:34:48 -0800 Asmus Freytag <asm...@ix.netcom.com> wrote:
> I believe that any attempt to define a "regex" that describes *all > legal text* in a given script is a-priori doomed to failure. > > Part of the problem is that writing systems work not unlike human > grammars in a curious mixture of pretty firm rules coupled to lists > of exceptions. (Many texts by competent authors will contain > "ungrammatical" sentences that somehow work despite or because of not > following the standard rules). The Khmer issue that started the > discussion showed that there can a be a single word that needs to be > handled exceptionally. It's a single word in the *current* orthography for the Khmer language in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire provisoire des caractères et divers signes des écritures khmères pré-modernes et modernes employés pour la notation du khmer, du siamois, des dialectes thaïs méridionaux, du sanskrit et du pāli" (http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner of writing was much commoner until it was largely eliminated by a spelling reform in the first half of the 20th century. The Thai Wikipedia page on the use of the script for Thai (https://th.wikipedia.org/wiki/อักษรขอมไทย) gives examples for final consonants with COENG VO (លែ្វ = แล้ว), COENG NO (បេ្ន = เป็น) and COENG NGO (ទ័្ង = ทั้ง). > If you try to capture all the exceptions in the general rules, the > set of rules gets complicated, but is also likely to be way too > permissive to be useful. If it is checking for proper use of code points, overgeneration is far preferable to undergeneration. > The Khmer LGR for the Root Zone, for example, deliberately disallows > the exception (in the word for "give") so that it can be stated (a) > more compactly and (b) does not allow the exceptional sequencing of > certain characters to become applicable outside the single exception. > > An LGR is concerned with *single* instances of each word. Even the > most common word in a language can only be registered once in each > zone. A label does not have to be a single word. For example, there are several, if not many, domain names matching give*.com, where the first element is clearly the word 'give'. > Even if the BNFs did nothing more than capture succinctly the > information presented in text and tables, they would be useful. > For scripts where things like ZWJ and CGJ are optional, it doesn't > make sense to run them into the standard BNF - that just messes > things up. It is much more useful to provide generic context > information of how to add them to existing text. > For example, the CGJ is really intended to go between letters. So, > describe that context. It can be quite useful next to combining marks. For example, it may be used to distinguish a diaeresis from an umlaut mark in Fraktur. Richard.