I believe that any attempt to define a "regex" that describes *all legal text* in a given script is a-priori doomed to failure.

Part of the problem is that writing systems work not unlike human grammars in a curious mixture of pretty firm rules coupled to lists of exceptions. (Many texts by competent authors will contain "ungrammatical" sentences that somehow work despite or because of not following the standard rules). The Khmer issue that started the discussion showed that there can a be a single word that needs to be handled exceptionally.

If you try to capture all the exceptions in the general rules, the set of rules gets complicated, but is also likely to be way too permissive to be useful.

The Khmer LGR for the Root Zone, for example, deliberately disallows the exception (in the word for "give") so that it can be stated (a) more compactly and (b) does not allow the exceptional sequencing of certain characters to become applicable outside the single exception.

An LGR is concerned with *single* instances of each word. Even the most common word in a language can only be registered once in each zone. Therefore, such a drastic treatment is a perfectly good solution. For a rendering engine, you'd want to be much more permissive, perhaps even attempt to display patently "wrong" sequences. For a validation tool (spell checker) you would strike for some other sweet spot. Finally, to determine "first word" or "first syllable" for formatting purposes (such as "drop caps") there may yet be a different selection.

As a result, I believe it would be most useful if a regex or BNF could be created for the "typical" / "idealized" description of a "word" in the various scripts.

Then, depending on the facts in question, the BNF could be augmented with more or less formalized descriptions of variations, exceptions, etc.

The idea would be to provide "building blocks" that can be used to assemble rules tailored to various scenarios by the reader of the standard. (Because of that, they should be part of the description section, not a data file...)

Even if the BNFs did nothing more than capture succinctly the information presented in text and tables, they would be useful.

For scripts where things like ZWJ and CGJ are optional, it doesn't make sense to run them into the standard BNF - that just messes things up. It is much more useful to provide generic context information of how to add them to existing text.

For example, the CGJ is really intended to go between letters. So, describe that context.

Overall, describing the local contexts for a given character or class of characters has proven to be more useful in the LGR project than attempting to write global rules.


On 1/13/2017 1:02 AM, Richard Wordingham wrote:
On Thu, 12 Jan 2017 21:03:29 +0100
Mark Davis ☕️ <m...@macchiato.com> wrote:

Latin is not a complex script,...
Unlike the common script, which notably has U+2044 FRACTION SLASH.

That statement is actually dubious from a typographical point of view.

...so it was only an illustration.
But it's good for looking for the non-obvious issues.

A more serious effort would look at some of the issues from
http://unicode.org/reports/tr29/, for example.
I don't think we want to have to repeat them all for each script.
Putting common-script punctuation and numbers in the regex will add
obscurity, and possibly be a maintainability issue.


Reply via email to