Re: [patch] improved equivalent classes in regular expressions

Dominique Pellé Tue, 15 Jan 2013 22:41:31 -0800

Christian Brabandt <[email protected]> wrote:

> Bram,
> I recently discovered, that using equivalence classes in regular
> expressions did not match all expected characters. Also I think, the
> current implementation does not work as expected, since searching for
> [[=Ä=]] does only match Ä and neither A nor any other A like character.
>
> So I looked into the standard¹ and found that apparently not all
> characters are matched according to it.
>
> I wrote a testfile² that contains all character codes that need to match
> for /[[=A=]]. If you search for /[[=A=]]$ you'll see, that some
> characters are skipped.
>
> So I threw together a small vim script³, that parses the given standard
> file and generates a huge switch statement to be used in the function
> reg_equi_class() of the regexp.c in the Vim source.
>
> Using this generated code in regexp.c, I created this patch⁴, which
> successfully matches all expected characters from that testfile. It also
> adds equivalence classes for the 10 digits 0-9 (and added some missing
> equivalence classes, e.g. for 'Q')
>
> However, some characters are now missing from the equivalence classes,
> like most notably U01E4 U01E5 U0149 U0166 U0167 U01B5 U01B6 since they
> are defined to have different primary weight than their Ascii
> counterparts (G g n T t Z z), so I removed those chars from test44
>
> regards,
> Christian
>
> ¹) ISO-14651:2012, available for free at
> http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
> If you download the zipfile below "ISO/IEC 14651:2011/Amd 1:2012", it
> contains the full reference table ISO14651_2012_TABLE1_en.txt (and
> should be equivalent to the Unicode standard)
> ²) Attached file A.txt
> ³) Attached file parse_iso14651.vim
> ⁴) Attached file new_equivalent_class.diff


Thanks Christian

I have not tried the patch yet but it looks like a nice improvement.

When using equivalent class [[=x=]], I realized that what I
generally want, is to use it on the full strings rather than on
a single characters. Searching for "foobar" with...

/[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]

... works but is rather unpleasant.  I wish there was a flag
such as \q switch on equivalent class, which would
work like \c for case insensitivity. So instead of the above
regexp, I could search for:

/\qfoobar

As far as I know \q is unused in Vim regexp, so
that should not break compatibility.

Maybe there could also be a function normalize({expr}}
(any better name?) that given a string with diacritics
"fòóbâr" returns "foobar" in similar way to tolower({expr}})
which returns a lowercase version of the string.

Before I spend time trying to do that, would it be useful
and accepted?

Regarding the few characters that are no longer equivalent,
I find it odd from a user point of view. For example U+01e4
(LATIN CAPITAL LETTER G WITH STROKE) was equivalent
to uppercase G but it is no longer equivalent to G.
Yet some other letters with stroke are still equivalent.
For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
is still equivalent to L. It seems inconsistent, even if that's
what the ISO standard says. Previous behavior made more
sense to me for U+1e4 at least.

Regards
-- Dominique

-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: [patch] improved equivalent classes in regular expressions

Raspunde prin e-mail lui