I believe that any attempt to define a "regex" that describes *all legal
text* in a given script is a-priori doomed to failure.
Part of the problem is that writing systems work not unlike human
grammars in a curious mixture of pretty firm rules coupled to lists of
exceptions. (Many texts by competent authors will contain
"ungrammatical" sentences that somehow work despite or because of not
following the standard rules). The Khmer issue that started the
discussion showed that there can a be a single word that needs to be
If you try to capture all the exceptions in the general rules, the set
of rules gets complicated, but is also likely to be way too permissive
to be useful.
The Khmer LGR for the Root Zone, for example, deliberately disallows the
exception (in the word for "give") so that it can be stated (a) more
compactly and (b) does not allow the exceptional sequencing of certain
characters to become applicable outside the single exception.
An LGR is concerned with *single* instances of each word. Even the most
common word in a language can only be registered once in each zone.
Therefore, such a drastic treatment is a perfectly good solution. For a
rendering engine, you'd want to be much more permissive, perhaps even
attempt to display patently "wrong" sequences. For a validation tool
(spell checker) you would strike for some other sweet spot. Finally, to
determine "first word" or "first syllable" for formatting purposes (such
as "drop caps") there may yet be a different selection.
As a result, I believe it would be most useful if a regex or BNF could
be created for the "typical" / "idealized" description of a "word" in
the various scripts.
Then, depending on the facts in question, the BNF could be augmented
with more or less formalized descriptions of variations, exceptions, etc.
The idea would be to provide "building blocks" that can be used to
assemble rules tailored to various scenarios by the reader of the
standard. (Because of that, they should be part of the description
section, not a data file...)
Even if the BNFs did nothing more than capture succinctly the
information presented in text and tables, they would be useful.
For scripts where things like ZWJ and CGJ are optional, it doesn't make
sense to run them into the standard BNF - that just messes things up. It
is much more useful to provide generic context information of how to add
them to existing text.
For example, the CGJ is really intended to go between letters. So,
describe that context.
Overall, describing the local contexts for a given character or class of
characters has proven to be more useful in the LGR project than
attempting to write global rules.
On 1/13/2017 1:02 AM, Richard Wordingham wrote:
On Thu, 12 Jan 2017 21:03:29 +0100
Mark Davis ☕️ <m...@macchiato.com> wrote:
Latin is not a complex script,...
Unlike the common script, which notably has U+2044 FRACTION SLASH.
That statement is actually dubious from a typographical point of view.
...so it was only an illustration.
But it's good for looking for the non-obvious issues.
A more serious effort would look at some of the issues from
http://unicode.org/reports/tr29/, for example.
I don't think we want to have to repeat them all for each script.
Putting common-script punctuation and numbers in the regex will add
obscurity, and possibly be a maintainability issue.