On 1/13/2017 9:47 AM, Richard Wordingham wrote:
On Fri, 13 Jan 2017 01:34:48 -0800
Asmus Freytag <[email protected]> wrote:

I believe that any attempt to define a "regex" that describes *all
legal text* in a given script is a-priori doomed to failure.

Part of the problem is that writing systems work not unlike human
grammars in a curious mixture of pretty firm rules coupled to lists
of exceptions. (Many texts by competent authors will contain
"ungrammatical" sentences that somehow work despite or because of not
following the standard rules). The Khmer issue that started the
discussion showed that there can a be a single word that needs to be
handled exceptionally.
It's a single word in the *current* orthography for the Khmer language
in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
provisoire des caractères et divers signes des écritures khmères
pré-modernes et modernes employés pour la notation du khmer, du
siamois, des dialectes thaïs méridionaux, du sanskrit et du pāli"
(http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner
of writing was much commoner until it was largely eliminated by a
spelling reform in the first half of the 20th century.

This points to another interesting issue. A number of languages have seen orthographic reforms that affect the use of complex scripts.

Now then, a decision: do you support both the old and the new style in the same rule-set? If vestiges remain in general use, you may not have a choice, but, what if the rules for old and new (or for different languages in the same script) actually conflict?

  The Thai
Wikipedia page on the use of the script for Thai
(https://th.wikipedia.org/wiki/อักษรขอมไทย) gives examples for final
consonants with COENG VO (លែ្វ = แล้ว), COENG NO (បេ្ន = เป็น) and
COENG NGO (ទ័្ង​ = ทั้ง).

In the case that I cited, that combination of language/script was taken as out of scope for other reasons; now, for general text, are there situations where you'd want separate sets of rules for each language?

If you try to capture all the exceptions in the general rules, the
set of rules gets complicated, but is also likely to be way too
permissive to be useful.
If it is checking for proper use of code points, overgeneration is far
preferable to undergeneration.

Agreed. For modeling general text you don't want to actually exclude anything that can occur. However, what can you exclude?

If you think of spell-checking as a scenario, overgeneration is not acceptable. Instead, you have a standard dictionary that deals with "general vocabulary" and there's a well defined mechanism to allow the user to add "exceptions".

My point is that you cannot design a ruleset without having a very well-defined use-case. If you divide the rule sets into "building blocks" then it may be easier to address different use cases than if you simply provide a "maximally permissive" set of rules.

I'm skeptical that a one size fits all sets of rules can be devised and be useful.

For rules that strongly err on the side of overgeneration, it might make more sense to simply define the few contexts that are deemed impermissible and set the rest to "anything goes".

The Khmer LGR for the Root Zone, for example, deliberately disallows
the exception (in the word for "give") so that it can be stated (a)
more compactly and (b) does not allow the exceptional sequencing of
certain characters to become applicable outside the single exception.

An LGR is concerned with *single* instances of each word. Even the
most common word in a language can only be registered once in each
zone.
A label does not have to be a single word.  For example, there are
several, if not many, domain names matching give*.com, where the first
element is clearly the word 'give'.

Correct, but each compound can still occur only once. I cite this example only because the local body that drafted the rules decided that there was a reasonable tradeoff (complexity vs. generality) for the purpose of top level domain names (i.e. ".give*" not "give*.com").

For that application, complexity has a relatively high negative weight associated with it, and complete coverage, while desirable, is not given the same high positive weight that it would have in describing ordinary text.

Even if the BNFs did nothing more than capture succinctly the
information presented in text and tables, they would be useful.
For scripts where things like ZWJ and CGJ are optional, it doesn't
make sense to run them into the standard BNF - that just messes
things up. It is much more useful to provide generic context
information of how to add them to existing text.
For example, the CGJ is really intended to go between letters. So,
describe that context.

(Forgot to make clear that this was a bit of a hypothetical)

It can be quite useful next to combining marks.  For example, it may be
used to distinguish a diaeresis from an umlaut mark in Fraktur.

Even if it is intended to go anywhere, even between digits, symbols and punctuation, it's much easier to describe that behavior separately rather than trying to insert it in every location in every regex. What I'm thinking is a description that gives a "skeleton word" and then you state, that this skeleton can be decorated (or whatever your preferred term) by inserting a CGJ anywhere.

The same goes for ZWJ /ZWNJ for any script where they don't have a recognized specific effect in particular sequences.

Reply via email to