|
I note that the current definition of
Hangul syllable sequences uses an ambiguous regular _expression_, where two
alternates can be used to match the _expression_. Also its analysis in a top-down
parser will use backward fallbacks to try the alternates.
$Hangul_Sequence = ((($L+ $LV?) | ($L*
$LV)) $V* $T*) | ($L* $LVT $T*);
This is more visible if written like
this:
$Hangul_Sequence = ( ( ($L+
$LV?)
| ($L* $LV )
) $V*
$T*)
| ( $L* $LVT
$T*);
Some remarks about this
_expression_:
- See the first two lines for the ambiguous
alternates when trying to match a string with form ($L+ $LV).
- See how the trailing $T* can be
factorized for the first alternate (first three lines) and the second alternate
(last line).
- See how it's also impossible to
discriminate these first two alternatives after reading just a
string starting with $L+.
- It is only deterministic after
reading a string starting by $LV or $LVT.
A better regular _expression_ without this
problem (for faster performance in non-deterministic automatas) is:
$Hangul_Sequence = ($L+ ( ($LVT
$T*)
| ($LV $V* $T*)
| ( $V* $T*)
))
| ($LVT
$T*)
| ($LV $V*
$T*);This regular _expression_, which matches
exactly the same "language" becomes fully determinist.
If one wants to include the factorisation
of trailing $T* (which may just reduce the number of states in non optimizing
regular _expression_ engines, but does not benefit
really to performance), this one works as well:
| ($LV $V*) | $V* )) | $LVT | ($LV $V*) ) $T*; |

