Re: Specification of Encoding of Plain Text

Asmus Freytag Fri, 13 Jan 2017 01:40:26 -0800

I believe that any attempt to define a "regex" that describes *all legaltext* in a given script is a-priori doomed to failure.

Part of the problem is that writing systems work not unlike humangrammars in a curious mixture of pretty firm rules coupled to lists ofexceptions. (Many texts by competent authors will contain"ungrammatical" sentences that somehow work despite or because of notfollowing the standard rules). The Khmer issue that started thediscussion showed that there can a be a single word that needs to behandled exceptionally.

If you try to capture all the exceptions in the general rules, the setof rules gets complicated, but is also likely to be way too permissiveto be useful.

The Khmer LGR for the Root Zone, for example, deliberately disallows theexception (in the word for "give") so that it can be stated (a) morecompactly and (b) does not allow the exceptional sequencing of certaincharacters to become applicable outside the single exception.

An LGR is concerned with *single* instances of each word. Even the mostcommon word in a language can only be registered once in each zone.Therefore, such a drastic treatment is a perfectly good solution. For arendering engine, you'd want to be much more permissive, perhaps evenattempt to display patently "wrong" sequences. For a validation tool(spell checker) you would strike for some other sweet spot. Finally, todetermine "first word" or "first syllable" for formatting purposes (suchas "drop caps") there may yet be a different selection.

As a result, I believe it would be most useful if a regex or BNF couldbe created for the "typical" / "idealized" description of a "word" inthe various scripts.

Then, depending on the facts in question, the BNF could be augmentedwith more or less formalized descriptions of variations, exceptions, etc.

The idea would be to provide "building blocks" that can be used toassemble rules tailored to various scenarios by the reader of thestandard. (Because of that, they should be part of the descriptionsection, not a data file...)

Even if the BNFs did nothing more than capture succinctly theinformation presented in text and tables, they would be useful.

For scripts where things like ZWJ and CGJ are optional, it doesn't makesense to run them into the standard BNF - that just messes things up. Itis much more useful to provide generic context information of how to addthem to existing text.

For example, the CGJ is really intended to go between letters. So,describe that context.

Overall, describing the local contexts for a given character or class ofcharacters has proven to be more useful in the LGR project thanattempting to write global rules.


A./

On 1/13/2017 1:02 AM, Richard Wordingham wrote:

On Thu, 12 Jan 2017 21:03:29 +0100
Mark Davis ☕️ <[email protected]> wrote:

Latin is not a complex script,...

Unlike the common script, which notably has U+2044 FRACTION SLASH.

That statement is actually dubious from a typographical point of view.

...so it was only an illustration.

But it's good for looking for the non-obvious issues.

A more serious effort would look at some of the issues from
http://unicode.org/reports/tr29/, for example.

I don't think we want to have to repeat them all for each script.
Putting common-script punctuation and numbers in the regex will add
obscurity, and possibly be a maintainability issue.

Richard.

Re: Specification of Encoding of Plain Text

Reply via email to