I note that the current definition of Hangul syllable sequences uses an ambiguous regular _expression_, where two alternates can be used to match the _expression_. Also its analysis in a top-down parser will use backward fallbacks to try the alternates.
 
$Hangul_Sequence = ((($L+ $LV?) | ($L* $LV)) $V* $T*) | ($L* $LVT $T*);
 
This is more visible if written like this:
 
$Hangul_Sequence = ( ( ($L+ $LV?)
                     | ($L* $LV )
                     )            $V* $T*)
                 | (    $L* $LVT      $T*);
 
Some remarks about this _expression_:
- See the first two lines for the ambiguous alternates when trying to match a string with form ($L+ $LV).
- See how the trailing $T* can be factorized for the first alternate (first three lines) and the second alternate (last line).
- See how it's also impossible to discriminate these first two alternatives after reading just a string starting with $L+.
- It is only deterministic after reading a string starting by $LV or $LVT.
 
A better regular _expression_ without this problem (for faster performance in non-deterministic automatas) is:
 
$Hangul_Sequence = ($L+ ( ($LVT     $T*)
                        | ($LV  $V* $T*)
                        | (     $V* $T*)
                        ))
                 |        ($LVT     $T*)
                 |        ($LV  $V* $T*);
 
This regular _expression_, which matches exactly the same "language" becomes fully determinist.
 
If one wants to include the factorisation of trailing $T* (which may just reduce the number of states in non optimizing regular _expression_ engines, but does not benefit really to performance), this one works as well:
 
$Hangul_Sequence = ( ($L+ (  $LVT
                          | ($LV  $V*)
                          |       $V*
                          ))
                   |         $LVT
                   |        ($LV  $V*)
                   )                   $T*;
 

Reply via email to