On Di, 29 Dez 2020, '[email protected]' via vim_dev wrote:
> [[:upper:]]*\{2,}* is not correctly applied, resulting in not finding what
> is searched for...
>
> Please refer to the below text fragment:
> --------------------------------------------------------------------------
> " Version: GVim 8.2.2148
> " OS: Windows 7, 64-bit
>
> " Test pattern
> 05. ПЕСНЯ О ГЕРОЯХ муз. А. Давиденко, М. Коваля и Б. Шехтера ...
> 05. PJESNJA O GJEROJAKH mus. A. Davidjenko, M. Kovalja i B. Shjekhtjera ...
>
> " Use these as search expressions
> /\<[[:upper:]]\+\> " Finds all uppercase letters
> /\<[[:upper:]]\{2,}\> " Not finding what is searched for(!)
> /\<[А-Я]\{2,}\> " Finds the specified range of cyrillic
> letters
> --------------------------------------------------------------------------
I suppose the problem is, that the second and fourth word in the input
isn't matched?
> 05. ПЕСНЯ О ГЕРОЯХ муз. А. Давиденко, М. Коваля и Б. Шехтера ...
^^^^^ ^^^^^^
That is an interesting case. There are 2 peculiarities here:
By default, Vim comes with two different regexp engines, which you can
switch using the 'regexpengine' option. (See :h 'regexpengine' and
:h two-engines)
By default, it uses the automatic mode, which is usually the NFA engine,
only for some costly patterns, it might fall-back to the old
backtracking engine.
For some reason, the NFA engine, when used in automatic mode, fails to
compile this regex (however it doesn't mention that it switches the
engines :/). I see this in the logfile:
,----
| >>> NFA engine failed...
| Regexp: "\<[[:upper:]]\{2,}\>"
| Postfix notation (char): "NFA_BOW , NFA_START_COLL, NFA_CLASS_UPPER,
NFA_CONCAT , NFA_END_COLL, "
| Postfix notation (int): -1006 -1021 -831 -1014 -1020
`----
Vim then switches back to backtracking engine (I am not sure why,
because it doesn't call `report_re_switch()`). The way this engine uses
POSIX character classes is basically it adds all possible upper
characters between 1-255 that are upper case characters into a big or
branch. I believe a character range can contain at most 256 characters
and I suppose because of old 8bit encodings it stops at 256. That's why
those other upper characters are not found.
However, if you manually switch to the nfa regexp engine, it starts to
work again. I am a bit puzzled, why this time compiling the engine
works.
I think an alternative (and faster) way would be to use the \u atom
instead of `[[:upper:]]`.
Best,
Christian
--
Was die neuen Unwissenden holen müssen:
Schlüssel zum Verfügungsraum
--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
---
You received this message because you are subscribed to the Google Groups
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/vim_dev/20201229160137.GD7513%40256bit.org.