On Di, 29 Dez 2020, '[email protected]' via vim_dev wrote:

> [[:upper:]]*\{2,}* is not correctly applied, resulting in not finding what 
> is searched for...
> 
> Please refer to the below text fragment:
> --------------------------------------------------------------------------
> " Version: GVim 8.2.2148
> " OS:      Windows 7, 64-bit
> 
> " Test pattern
> 05. ПЕСНЯ О ГЕРОЯХ муз. А. Давиденко, М. Коваля и Б. Шехтера ...
> 05. PJESNJA O GJEROJAKH mus. A. Davidjenko, M. Kovalja i B. Shjekhtjera ...
> 
> " Use these as search expressions
> /\<[[:upper:]]\+\>           " Finds all uppercase letters
> /\<[[:upper:]]\{2,}\>       " Not finding what is searched for(!)
> /\<[А-Я]\{2,}\>                " Finds the specified range of cyrillic 
> letters
> --------------------------------------------------------------------------

I suppose the problem is, that the second and fourth word in the input 
isn't matched?

> 05. ПЕСНЯ О ГЕРОЯХ муз. А. Давиденко, М. Коваля и Б. Шехтера ...
      ^^^^^   ^^^^^^

That is an interesting case. There are 2 peculiarities here:

By default, Vim comes with two different regexp engines, which you can 
switch using the 'regexpengine' option. (See :h 'regexpengine' and
:h two-engines)

By default, it uses the automatic mode, which is usually the NFA engine, 
only for some costly patterns, it might fall-back to the old 
backtracking engine.

For some reason, the NFA engine, when used in automatic mode, fails to 
compile this regex (however it doesn't mention that it switches the 
engines :/). I see this in the logfile:

,----
| >>> NFA engine failed...
| Regexp: "\<[[:upper:]]\{2,}\>"
| Postfix notation (char): "NFA_BOW , NFA_START_COLL, NFA_CLASS_UPPER, 
NFA_CONCAT , NFA_END_COLL, "
| Postfix notation (int): -1006 -1021 -831 -1014 -1020
`----

Vim then switches back to backtracking engine (I am not sure why, 
because it doesn't call `report_re_switch()`). The way this engine uses 
POSIX character classes is basically it adds all possible upper 
characters between 1-255 that are upper case characters into a big or 
branch. I believe a character range can contain at most 256 characters 
and I suppose because of old 8bit encodings it stops at 256. That's why 
those other upper characters are not found.

However, if you manually switch to the nfa regexp engine, it starts to 
work again. I am a bit puzzled, why this time compiling the engine 
works.

I think an alternative (and faster) way would be to use the \u atom 
instead of `[[:upper:]]`.

Best,
Christian
-- 
Was die neuen Unwissenden holen müssen:
Schlüssel zum Verfügungsraum

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/20201229160137.GD7513%40256bit.org.

Raspunde prin e-mail lui