Hi Bram, On Wed, Apr 12, 2023 at 10:36 AM Bram Moolenaar <[email protected]> wrote:
> > Yegappan wrote: > > > The language server protocol supports specifying offsets in text > > documents using UTF-8 or UTF-16 or UTF-32 code units. > > The UTF-16 code unit is the default. > > > > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments > > > > Different language servers have different levels of support for using > > the different code units. Vim uses the UTF-32 code units for the > > offsets. This makes it difficult to support different language > > servers from a Vim LSP plugin. > > > > Port the strutfindex() and strbyteindex() functions from Neovim to > > support this. > > I find the function names hard to read and confusing. We might be able > to think of better names when the exact functionality is described. > > The terminology is confusing. "UTF-32 byte index" contradicts itself, > since each character is four bytes. I think what is meant is "UTF-32 > encoded character index", which is equal to "character index", since > there is no Unicode character that takes more than one UTF-32 code > point. > > In Vim all Unicode characters are internally encoded with UTF-8. Thus > the "{string}" argument of strbyteindex() will be UTF-8 encoded. This > is also confusing. The help should be clearer about what this means > exactly. I'm not sure how, saying something like "the character index > of "{string}" if it would be encoded with UTF-32" makes it complex. I > think that instead of using "UTF-32 index" we can just use "character > index", and somewhere mention that "UTF-32" can be considered the same > (if we need to mention this at all, since the term "UTF-32" isn't widely > used). > > For "UTF-16" it gets more complicated, we can't avoid mentioning that > the index applies to "{string}" encoded as UTF-16. Looking back UTF-16 > should have never been made a standard IMHO, but it exists and it is > used (especially on MS-Windows), thus we need to support it. > > Conversion between UTF-8 and character index already exists, you can use > charidx() and byteidx()/byteidxcomp(). Possibly we only need to add > functions to convert between UTF-8 and UTF-16 indexes? Or between > character (UTF-32) and UTF-16 indexes? The latter makes more sense. > > It should also be possible to specify the handling of composing > characters. Either as an argument, like with charidx(), or using > separate functions, as with byteidx()/byteidxcomp(). > > > I have updated the PR to add the utf16idx() function and introduced an optional UTF-16 flag to the byteidx() and byteidxcomp() functions. - Yegappan -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/vim_dev/CAAW7x7n8sA%3D_0%3Dd54YfpZEpk3T8%3DvSc%3DeXryPAy%3DfK97YT5t6w%40mail.gmail.com.
