Re: Getting the byte index (column) given the character column number

Yegappan Lakshmanan Mon, 21 Nov 2022 22:14:42 -0800

Hi Bram,

On Mon, Nov 21, 2022 at 2:17 PM Bram Moolenaar <[email protected]> wrote:
>
>
> Yegappan wrote:
>
> > > > > > The language server protocol messages use character column number
> > > > > > whereas many of the built-in Vim functions (e.g. matchaddpos()) deal
> > > > > > with byte column number.
> > > > > >
> > > > > > Several built-in functions were added to convert between the 
> > > > > > character
> > > > > > and byte column numbers (byteidx(), charcol(), charidx(),
> > > > > > getcharpos(), getcursorcharpos(), etc,).
> > > > > > But these functions deal with strings, current cursor position or 
> > > > > > the
> > > > > > position of a mark.
> > > > > >
> > > > > > We currently don't have a function to return the byte number given 
> > > > > > the
> > > > > > character number in a line in a buffer.  The workaround is to use
> > > > > > getbufline() to get the entire buffer line and then use byteidx() to
> > > > > > get the byte number from the character number.
> > > > > >
> > > > > > I am thinking of introducing a new function named charcol2bytecol()
> > > > > > that accepts a buffer number, line number and the character number 
> > > > > > in
> > > > > > the line and returns the corresponding byte number.  Any
> > > > > > suggestions/comments on this?
> > > > > >
> > > > > > We should also modify the matchaddpos() function to accept
> > > > > > character numbers in a line in addition to the byte numbers.
> > > > >
> > > > > Just to make sure we understand what we are talking about: This is
> > > > > always about text in a buffer?  Thus the buffer text is somehow passed
> > > > > through the LSP to a server, which then returns information with
> > > > > character indexes.
> > > >
> > > > Yes.  The location information returned by the LSP server is about the
> > > > text in the buffer.
> > > >
> > > > > One detail that matters: Are composing characters counted separately, 
> > > > > or
> > > > > not counted (part of the base character)?
> > > >
> > > > I think composing counters are not counted.  But I couldn't find this
> > > > mentioned in the LSP specification:
> > > >
> > > > https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position
> > >
> > > Disappointing to not mention such an important part of the interface.
> > > Since I do not see any mention of composing characters, I would guess
> > > that each utf-8 character is counted separately.
> > >
> > > > > Also, I assume a Tab is counted as just one character, not the number 
> > > > > of
> > > > > display cells it occupies.
> > > >
> > > > Yes. Tab is counted as one character.
> > > >
> > > > > I wonder if it's really helpful to add a new function if it can
> > > > > currently be done with two.  You already mention that the text can be
> > > > > obtained with getbufline(), and then get the byte index from the
> > > > > character index with byteidx().  What is the problem with doing it 
> > > > > that
> > > > > way?
> > > >
> > > > If the conversion has to be done too many times then it is not 
> > > > efficient.
> > >
> > > How can you say that without trying?
> >
> > I used the attached Vim9 script to measure the performance of
> > getbufline() + byteidx()
> > compared to calling the col() function.  I see that the first one
> > takes three times longer to get the column number compared to the
> > second one.
>
> This must be because getbufline() always returns a list of strings.
> Creating the list, adding a list item and then making a copy of the text
> takes longer.  Using getline() (just to try it out, wouldn't work in
> your actual code) brings the difference down to less than two times.
>
> Not storing the result of getbufline() in a variable, but passing it to
> byteidx() with "->" also helps make it faster.
>
> The range should be bigger, I used 10x to get more stable results.  As a
> rule of thumb: the profiling time should be at least 100 msec to avoid
> too much fluctuation.
>
> After making some adjustments it is now only about 16% slower.
> I'll make a patch to get getbufoneline(), since just getting the string
> for one line would be very common and it is about twice as fast.
>
> The name getbufoneline() isn't nice, couldn't come up with something
> better.  Should have called the existing function getbuflines() instead
> of getbufline(), but we can't change that now.
>
> The resulting essential line in ProfByteIdxFunction():
>
>     idx = getbufoneline('', 5344)->byteidx(77)
>
> > > Getting the buffer line means making a copy of the text, that's
> > > quite cheap.  The only added overhead is two function calls instead
> > > of one, which has really minimal impact in the context of all the
> > > other things being done.  Also, if there are multiple positions in
> > > one line then getbufline() only needs to be called once, thus
> > > performance should be very close to whatever function we would use
> > > instead.
> > >
> > > > > Other message:
> > > > >
> > > > > > Another alternative is to extend the col() function.  The col()
> > > > > > function currently accepts a list with two numbers (a line number 
> > > > > > and
> > > > > > a byte number or "$") and returns the byte number.
> > > > > > This can be modified to also accept a list with three numbers (line
> > > > > > number, column number and a boolean indicating character column or
> > > > > > byte column) and return the byte number.
> > > > >
> > > > > I don't like this, the first line for the col() help is:
> > > > >
> > > > >         The result is a Number, which is the byte index of the column
> > > > >
> > > > > When the boolean is true this would be the character index, that is 
> > > > > hard
> > > > > to explain.  A user would have to look really hard to find this
> > > > > functionality.
> > > >
> > > > The boolean doesn't change the return value of the col() function.  It 
> > > > just
> > > > changes how the col() function interprets the column number in the list.
> > > > If it is true, then the col() function will use the column number as the
> > > > character number.  If it is false or not specified, then the col() 
> > > > function
> > > > will use it as the byte number.  In both cases the col() function will 
> > > > always
> > > > return the byte index of the column.
> > >
> > > I was confused.  Currently in the [lnum, col] value of {expr} the column
> > > is the character offset.
> >
> > Currently in the [lnum, col] value of [expr], the column is the byte offset.
> > For example, if you use multibyte characters in a line and get the column
> > number:
> >
> > =====================================================
> > new
> > call setline(1, "\u2345\u2346\u2347\u2348")
> > echo col([1, 3])
> > =====================================================
> >
> > The above script echos 3 instead of 7.  The byte index of the third
> > character is 7.
>
> Should really update the help to avoid the term "column number", it is
> confusing.  The remark "Most useful when the column is "$"" is a hint
> that is easily missed.
>
> OK, I finally see your point, sorry it took so long.
>
> Unfortunately, adding a third argument that is a flag, indicating whether
> the second argument means bytes or characters, conflicts with other
> places where the third argument is "coloff".  This is used with
> virtcol() for example.
>
> You also still have the limitation that col() only works for the current
> buffer.
>
> Making matchaddpos() accept a character index instead of a byte index is
> going to trigger doing this in many more places.  And internally the
> conversion will have to be done anyway.  Therefore sticking to using a
> byte index in most places that deal with text avoids a lot of complexity
> in the arguments of the functions.
>
> So let's go back to making the character index to byte index conversion
> fast.  That is a generic solution and avoids changes all over the place.
> Please try out the new getbufoneline() function, as mentioned above.
>


I tested the new getbufoneline() function and the performance is much
better.  Thanks for adding this function.

>
> If the performance is indeed quite bad, adding a function that converts
> a text location in a buffer specified by character index to a byte index
> could be a solution.  Perhaps:
>
>    bufcol({buf}, {expr})             {expr} a string like with col()
>    bufcol({buf}, {lnum}, {expr})     {expr} a string like with col()
>    bufcol({buf}, {lnum}, {charidx})
>

For now, I think we can use the getbufoneline() and byteidx() functions.
If another use case for this comes up in the future, we can add this.

Regards,
Yegappan

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/vim_dev/CAAW7x7%3DMgfVo2_2Fbz2EVrb9uJVGpar7V127%3D424rJNB5QOKfA%40mail.gmail.com.

Re: Getting the byte index (column) given the character column number

Raspunde prin e-mail lui