Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Yegappan Lakshmanan Thu, 13 Apr 2023 21:56:26 -0700

Hi Bram,

On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar <[email protected]>
wrote:


>
> Yegappan wrote:
>
> > The language server protocol supports specifying offsets in text
> > documents using UTF-8 or UTF-16 or UTF-32 code units.
> > The UTF-16 code unit is the default.
> >
> >
> https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
> >
> > Different language servers have different levels of support for using
> > the different code units. Vim uses the UTF-32 code units for the
> > offsets. This makes it difficult to support different language
> > servers from a Vim LSP plugin.
> >
> > Port the strutfindex() and strbyteindex() functions from Neovim to
> > support this.
>
> I find the function names hard to read and confusing. We might be able
> to think of better names when the exact functionality is described.
>
> The terminology is confusing. "UTF-32 byte index" contradicts itself,
> since each character is four bytes. I think what is meant is "UTF-32
> encoded character index", which is equal to "character index", since
> there is no Unicode character that takes more than one UTF-32 code
> point.
>
> In Vim all Unicode characters are internally encoded with UTF-8. Thus
> the "{string}" argument of strbyteindex() will be UTF-8 encoded. This
> is also confusing. The help should be clearer about what this means
> exactly. I'm not sure how, saying something like "the character index
> of "{string}" if it would be encoded with UTF-32" makes it complex. I
> think that instead of using "UTF-32 index" we can just use "character
> index", and somewhere mention that "UTF-32" can be considered the same
> (if we need to mention this at all, since the term "UTF-32" isn't widely
> used).
>
> For "UTF-16" it gets more complicated, we can't avoid mentioning that
> the index applies to "{string}" encoded as UTF-16. Looking back UTF-16
> should have never been made a standard IMHO, but it exists and it is
> used (especially on MS-Windows), thus we need to support it.
>
> Conversion between UTF-8 and character index already exists, you can use
> charidx() and byteidx()/byteidxcomp(). Possibly we only need to add
> functions to convert between UTF-8 and UTF-16 indexes? Or between
> character (UTF-32) and UTF-16 indexes? The latter makes more sense.
>
> It should also be possible to specify the handling of composing
> characters. Either as an argument, like with charidx(), or using
> separate functions, as with byteidx()/byteidxcomp().
>
>
>
I have updated the PR to add the utf16idx() function and introduced an
optional
UTF-16 flag to the byteidx() and byteidxcomp() functions.

- Yegappan

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/CAAW7x7n8sA%3D_0%3Dd54YfpZEpk3T8%3DvSc%3DeXryPAy%3DfK97YT5t6w%40mail.gmail.com.

Re: [vim/vim] Not able to convert between byte index and UTF indices (PR #12216)

Raspunde prin e-mail lui