Re: Specification of Encoding of Plain Text

Asmus Freytag Fri, 13 Jan 2017 10:32:21 -0800

On 1/13/2017 9:47 AM, Richard Wordingham wrote:

On Fri, 13 Jan 2017 01:34:48 -0800
Asmus Freytag <[email protected]> wrote:

I believe that any attempt to define a "regex" that describes *all
legal text* in a given script is a-priori doomed to failure.

Part of the problem is that writing systems work not unlike human
grammars in a curious mixture of pretty firm rules coupled to lists
of exceptions. (Many texts by competent authors will contain
"ungrammatical" sentences that somehow work despite or because of not
following the standard rules). The Khmer issue that started the
discussion showed that there can a be a single word that needs to be
handled exceptionally.

It's a single word in the *current* orthography for the Khmer language
in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
provisoire des caractères et divers signes des écritures khmères
pré-modernes et modernes employés pour la notation du khmer, du
siamois, des dialectes thaïs méridionaux, du sanskrit et du pāli"
(http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner
of writing was much commoner until it was largely eliminated by a
spelling reform in the first half of the 20th century.

This points to another interesting issue. A number of languages haveseen orthographic reforms that affect the use of complex scripts.

Now then, a decision: do you support both the old and the new style inthe same rule-set? If vestiges remain in general use, you may not have achoice, but, what if the rules for old and new (or for differentlanguages in the same script) actually conflict?

  The Thai
Wikipedia page on the use of the script for Thai
(https://th.wikipedia.org/wiki/อักษรขอมไทย) gives examples for final
consonants with COENG VO (លែ្វ = แล้ว), COENG NO (បេ្ន = เป็น) and
COENG NGO (ទ័្ង = ทั้ง).

In the case that I cited, that combination of language/script was takenas out of scope for other reasons; now, for general text, are theresituations where you'd want separate sets of rules for each language?

If you try to capture all the exceptions in the general rules, the
set of rules gets complicated, but is also likely to be way too
permissive to be useful.

If it is checking for proper use of code points, overgeneration is far
preferable to undergeneration.

Agreed. For modeling general text you don't want to actually excludeanything that can occur. However, what can you exclude?

If you think of spell-checking as a scenario, overgeneration is notacceptable. Instead, you have a standard dictionary that deals with"general vocabulary" and there's a well defined mechanism to allow theuser to add "exceptions".

My point is that you cannot design a ruleset without having a verywell-defined use-case. If you divide the rule sets into "buildingblocks" then it may be easier to address different use cases than if yousimply provide a "maximally permissive" set of rules.

I'm skeptical that a one size fits all sets of rules can be devised andbe useful.

For rules that strongly err on the side of overgeneration, it might makemore sense to simply define the few contexts that are deemedimpermissible and set the rest to "anything goes".

The Khmer LGR for the Root Zone, for example, deliberately disallows
the exception (in the word for "give") so that it can be stated (a)
more compactly and (b) does not allow the exceptional sequencing of
certain characters to become applicable outside the single exception.

An LGR is concerned with *single* instances of each word. Even the
most common word in a language can only be registered once in each
zone.

A label does not have to be a single word.  For example, there are
several, if not many, domain names matching give*.com, where the first
element is clearly the word 'give'.

Correct, but each compound can still occur only once. I cite thisexample only because the local body that drafted the rules decided thatthere was a reasonable tradeoff (complexity vs. generality) for thepurpose of top level domain names (i.e. ".give*" not "give*.com").

For that application, complexity has a relatively high negative weightassociated with it, and complete coverage, while desirable, is not giventhe same high positive weight that it would have in describing ordinarytext.

Even if the BNFs did nothing more than capture succinctly the
information presented in text and tables, they would be useful.
For scripts where things like ZWJ and CGJ are optional, it doesn't
make sense to run them into the standard BNF - that just messes
things up. It is much more useful to provide generic context
information of how to add them to existing text.
For example, the CGJ is really intended to go between letters. So,
describe that context.


(Forgot to make clear that this was a bit of a hypothetical)

It can be quite useful next to combining marks.  For example, it may be
used to distinguish a diaeresis from an umlaut mark in Fraktur.

Even if it is intended to go anywhere, even between digits, symbols andpunctuation, it's much easier to describe that behavior separatelyrather than trying to insert it in every location in every regex. WhatI'm thinking is a description that gives a "skeleton word" and then youstate, that this skeleton can be decorated (or whatever your preferredterm) by inserting a CGJ anywhere.

The same goes for ZWJ /ZWNJ for any script where they don't have arecognized specific effect in particular sequences.

Re: Specification of Encoding of Plain Text

Reply via email to