Re: Irritating column numbers with encoding=utf-8

2006-07-06 Thread Jürgen Krämer

Hi,

Bram Moolenaar wrote:

 Jürgen Krämer wrote:

 with 'encoding' set to utf-8 there is a quite confusing (to me)
 difference between the column number and my expectations (supported by
 the virtual column number) if there are non-ASCII characters on the
 line. I don't know what the intended meaning of column count and the
 intended behaviour of cursor() are, but it seems they both depend on
 the size of the encoded characters. I always thought nth column was
 more or less a synonym for nth character on a line while nth virtual
 column meant nth cell on a screen line.

[snipped

 I don't know whether the shown behaviour is a bug or just a feature I
 don't like, but in summary I think column number should really
 represent a character count (i.e, corresponding to what the user sees),
 not a byte count depending on the underlying encoding.

 I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing
 the code will definitely introduce an incompatibility. So the final
 question is: What do you (Vimmers) and you (Bram) think: is there a need
 for a change.

 I don't know why you call this a column count, in most places it's
 called a byte count.  Perhaps in some places in the docs the remark
 about this actually being a byte count is missing.

sorry, the column count in the first paragraph should have been a
column number. I called it so because I have the statusline option set
to

  %%f%= [%1*%M%*%{','.fileformat}%R%Y] [%6l,%4c%V] %3b=0x%02B %P

and noticed that %4c-%V displayed two numbers instead of the one I
expected, because I knew there were no tabs or unprintable characters
on that line. Even more disturbing was the fact that the first number
(the column number) was bigger than the second one (the virtual column
number). So I checked :help statusline and it told me

c N   Column number.
v N   Virtual column number.
V N   Virtual column number as -{num}.  Not displayed if equal to 'c'.

 You could also want a character count.  But what is a character when
 using composing characters?  E.g., when the umlaut is not included in
 a character but added as a separate composing character?

I would say that a character is what the user sees. Why should he (be
forced to) know wheter ä is represented internally as LATIN SMALL
LETTER A WITH DIAERESIS or as LATIN SMALL LETTER A plus COMBINING
DIARESIS? So in my opinion column count is equivalent to character
count unless there are characters like tabs and unprintable ones that
have a special representation -- on the screen, not internally.

 It's not so obvious what to do.  In these situations I rather keep it as
 it is.

I know it's a big change and would introduce imcompatibiliy with older
versions, but here is another example: Take this line (ignoring the
leading spaces)

  ääbbcc

and the following commands

  :s/\%3c../xx/
  %s/^..\zs../xx/

From my point of view they should both replace the 3rd and 4th column
with xx. When encoding is set to latin1 they do, but not when it is
set to utf-8 -- the first one replaces äb with xx. As a user I would
be really stumbled and ask Why that, it's the same text as before.

Changing these commands to

  :s/\%2c../xx/
  %s/^.\zs../xx/

makes things even more irritating. The second one works as expected, now
correctly replacing äb with xx, but the first one fails with E486:
Pattern not found: \%2c... Again: Ought I (as a user) really need to
know that \%2c depends on the number of non-ASCII letters in front of
the column I'm interested in?

Regards,
Jürgen

-- 
Jürgen Krämer  Softwareentwicklung
HABEL GmbH  Co. KGmailto:[EMAIL PROTECTED]
Hinteres Öschle 2  Tel: +49 / 74 61 / 93 53 - 15
78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99


RE: Irritating column numbers with encoding=utf-8

2006-07-06 Thread Zdenek Sekera
 -Original Message-
 From: Jürgen Krämer [mailto:[EMAIL PROTECTED] 
 Sent: 06 July 2006 08:01
 To: vim mailing list
 Subject: Re: Irritating column numbers with encoding=utf-8
 
 
 Hi,
 
 Bram Moolenaar wrote:
 
  Jürgen Krämer wrote:
 
  with 'encoding' set to utf-8 there is a quite confusing (to me)
  difference between the column number and my expectations 
 (supported by
  the virtual column number) if there are non-ASCII characters on the
  line. I don't know what the intended meaning of column 
 count and the
  intended behaviour of cursor() are, but it seems they 
 both depend on
  the size of the encoded characters. I always thought nth 
 column was
  more or less a synonym for nth character on a line while 
 nth virtual
  column meant nth cell on a screen line.
 
 [snipped
 
  I don't know whether the shown behaviour is a bug or just 
 a feature I
  don't like, but in summary I think column number should really
  represent a character count (i.e, corresponding to what 
 the user sees),
  not a byte count depending on the underlying encoding.
 
  I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, 
 so changing
  the code will definitely introduce an incompatibility. So the final
  question is: What do you (Vimmers) and you (Bram) think: 
 is there a need
  for a change.
 
  I don't know why you call this a column count, in most places it's
  called a byte count.  Perhaps in some places in the docs the remark
  about this actually being a byte count is missing.
 
 sorry, the column count in the first paragraph should have been a
 column number. I called it so because I have the statusline 
 option set
 to
 
   %%f%= [%1*%M%*%{','.fileformat}%R%Y] [%6l,%4c%V] %3b=0x%02B %P
 
 and noticed that %4c-%V displayed two numbers instead of the one I
 expected, because I knew there were no tabs or unprintable characters
 on that line. Even more disturbing was the fact that the first number
 (the column number) was bigger than the second one (the virtual column
 number). So I checked :help statusline and it told me
 
   c N   Column number.
   v N   Virtual column number.
   V N   Virtual column number as -{num}.  Not displayed 
 if equal to 'c'.
 
  You could also want a character count.  But what is a character when
  using composing characters?  E.g., when the umlaut is not 
 included in
  a character but added as a separate composing character?
 
 I would say that a character is what the user sees. Why should he (be
 forced to) know wheter ä is represented internally as LATIN SMALL
 LETTER A WITH DIAERESIS or as LATIN SMALL LETTER A plus COMBINING
 DIARESIS? So in my opinion column count is equivalent to character
 count unless there are characters like tabs and unprintable ones that
 have a special representation -- on the screen, not internally.
 
  It's not so obvious what to do.  In these situations I 
 rather keep it as
  it is.
 
 I know it's a big change and would introduce imcompatibiliy with older
 versions, but here is another example: Take this line (ignoring the
 leading spaces)
 
   ääbbcc
 
 and the following commands
 
   :s/\%3c../xx/
   %s/^..\zs../xx/
 
 From my point of view they should both replace the 3rd and 4th column
 with xx. When encoding is set to latin1 they do, but not when it is
 set to utf-8 -- the first one replaces äb with xx. As a 
 user I would
 be really stumbled and ask Why that, it's the same text as before.
 
 Changing these commands to
 
   :s/\%2c../xx/
   %s/^.\zs../xx/
 
 makes things even more irritating. The second one works as 
 expected, now
 correctly replacing äb with xx, but the first one fails 
 with E486:
 Pattern not found: \%2c... Again: Ought I (as a user) really need to
 know that \%2c depends on the number of non-ASCII letters in front of
 the column I'm interested in?

Yes, this is indeed very unexpected IMHO and as you say
mighty irritating. I find it very hard to disagree with
your arguments. This should be changed IMHO, even if 
it surely is a big change.

---Zdenek


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread James Vega
On Wed, Jul 05, 2006 at 11:50:51AM +0200, Jürgen Krämer wrote:
 
 Hi,
 
 with 'encoding' set to utf-8 there is a quite confusing (to me)
 difference between the column number and my expectations (supported by
 the virtual column number) if there are non-ASCII characters on the
 line.

Column number n is really the nth byte on that line.  This is described
at :help /\%c.  This description should explain all the behavior
you're seeing.  This is the intended behavior and I'm not sure of a way
off-hand to get the visual character count like you want.

James
-- 
GPG Key: 1024D/61326D40 2003-09-02 James Vega [EMAIL PROTECTED]


signature.asc
Description: Digital signature


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Jürgen Krämer

Hi,

James Vega wrote:

 On Wed, Jul 05, 2006 at 11:50:51AM +0200, Jürgen Krämer wrote:

 with 'encoding' set to utf-8 there is a quite confusing (to me)
 difference between the column number and my expectations (supported by
 the virtual column number) if there are non-ASCII characters on the
 line.

 Column number n is really the nth byte on that line.  This is described
 at :help /\%c.  This description should explain all the behavior
 you're seeing.  This is the intended behavior and I'm not sure of a way
 off-hand to get the visual character count like you want.

yes, it does *explain* the behaviour. But it makes things even worse.
Suppose I have some lines with aligned data (just like a table) where I
want to replace certain columns with dashes, e.g.,

  PeterTraurig irgendwo  0
  Hänschen Klein   unterwegs 1
  Jürgen   Krämer  hier  2

  :%s/\%18c.*\%27c/-/

should strike out the third column of the table, but the result is

  PeterTraurig - 0
  Hänschen Klein  -s 1
  Jürgen   Krämer-   2

which is depending on the random number of non-ASCII characters in front
of the used position, characters whose internal representations should
never be relevant for this substitution, because the user cannot know
them.

Since it works as documented it is hard to call it a bug, but I would
really consider it a mis-feature, because it works in such a
non-predictable way.

To work around the problem in this example is not that hard -- I can use
/\%...v instead. The example in my original mail poses a bigger problem
(to me) -- I'd like to switch to encoding=utf-8 as default, but I
often need to work with text files of fixed line length. With encoding
set to latin1 the difference between column number and virtual column
number in the status line is a visual clue that there is a tabular or a
control code in the line, reminding me to look for this character. With
UTF-8 encoding this hint would be rendered useless because of all those
little umlauts in German. :-(

But perhaps this is just my special problem.

Regards,
Jürgen


-- 
Jürgen Krämer  Softwareentwicklung
HABEL GmbH  Co. KGmailto:[EMAIL PROTECTED]
Hinteres Öschle 2  Tel: +49 / 74 61 / 93 53 - 15
78604 Rietheim-WeilheimFax: +49 / 74 61 / 93 53 - 99


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Yakov Lerner

On 7/5/06, Jürgen Krämer [EMAIL PROTECTED] wrote:

To work around the problem in this example is not that hard -- I can use
/\%...v instead.

Yes


The example in my original mail poses a bigger problem
(to me) -- I'd like to switch to encoding=utf-8 as default, but I
often need to work with text files of fixed line length. With encoding
set to latin1 the difference between column number and virtual column
number in the status line is a visual clue that there is a tabular or a
control code in the line, reminding me to look for this character. With
UTF-8 encoding this hint would be rendered useless because of all those
little umlauts in German. :-(


There's yet another reason for col()!=virtcol().

It's unprintable characters like ^A ^@ ^[
Granted, they occur rarely in textfiles, but if they do,
they'll cause virtcol() != col().

If you stick with virtcol() and \%v, you'll
probably not feel any inconvenience. I mean, there are two types
of columns (virtual and non-virtual), and if someone
confuses the two, and uses %\c instead of %\v or col() instead of
virtcol(), or vice versa, it's inconvenient.

Once the confusion is fixed, and you use the right type
of column index, doesn't it solve the inconvenience ?
(except that there are still two types of columns, which
requires increased attention as to which one
to use in each case) ?

Yakov


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Yakov Lerner

On 7/5/06, Jürgen Krämer [EMAIL PROTECTED] wrote:

with 'encoding' set to utf-8 there is a quite confusing (to me)
difference between the column number and my expectations (supported by
the virtual column number) if there are non-ASCII characters on the
line.


And additional remark. As James noted, \%c
is not character offset (in case of multibyte chars),
but the bytes offset.

In case you want to match
not by visual columns (\%v) and not by byte
offset, but by character index in the line, you
can do this:

/^.\{22}xyz

This matches xyz at 23nd char position,
correctly counting each multibyte chars and
each single char for 1 position. Does this
possibly solve your matching problem ?

Yakov


Re: Irritating column numbers with encoding=utf-8

2006-07-05 Thread Bram Moolenaar

Jürgen Krämer wrote:

 with 'encoding' set to utf-8 there is a quite confusing (to me)
 difference between the column number and my expectations (supported by
 the virtual column number) if there are non-ASCII characters on the
 line. I don't know what the intended meaning of column count and the
 intended behaviour of cursor() are, but it seems they both depend on
 the size of the encoded characters. I always thought nth column was
 more or less a synonym for nth character on a line while nth virtual
 column meant nth cell on a screen line.
 
 Here is how to reproduce the observed behaviour. Start
 
vim -u NONE -U NONE
 
 and
 
   :set encoding=utf-8
   :set laststatus=2
   :set statusline=[%c/%v]
 
 (The last line tells VIM to display the column and the virtual column.)
 Now enter two lines
 
   abc
   äbc
 
 (The first letter in the second line is a lower case A with umlaut.)
 While moving the cursor over the different characters on the first line
 the status line shows [1/1], [2/2], and [3/3], respectively,
 telling you that column and virtual column are equal. That is the
 expected behaviour as long as there are no special characters like tabs
 and non-printable characters.
 
 Now move the cursor over the characters in the second line. While the
 cursor is over the ä [1/1] is displayed, but the next characters
 result in [3/2] and [4/3], respectively. It seems as if ä (or any
 non-ASCII character, for that matter) is accounting for (at least) two
 columns while encoding is set to utf-8. Although I know that ä is
 represented by two bytes in UTF-8 encoding, I find this behaviour
 irritating because on the surface it's only one character. It even gets
 worse (IMHO) with characters that need three bytes in UTF-8 encoding,
 like LATIN CAPITAL LETTER A WITH DOT BELOW (0x1EA0), which increase the
 column number by three.
 
 Also the cursor() function shows this kind of interpretation of
 non-ASCII characters. Both
 
   call cursor(2, 1)
 
 and
 
   call cursor(2, 2)
 
 place the cursor on ä. To place it on b you need to
 
   call cursor(2, 3)
 
 although I would expect that already the second example would place the
 cursor on b.
 
 I can think of two ways to circumvent this problem:
 
   1) switching to encoding=latin1, which is not always an option
  because of the need for characters outside the scope of latin1;
 
   2) using only virtual column numbers in the status line, but this
  gives different results when characters like tab or non-printables
  are displayed in more than one screen cell (which is of course
  reasonable).
 
 I don't know whether the shown behaviour is a bug or just a feature I
 don't like, but in summary I think column number should really
 represent a character count (i.e, corresponding to what the user sees),
 not a byte count depending on the underlying encoding.
 
 I have seen this behaviour in VIM 6.2, 6.3, 6.4, and 7.0, so changing
 the code will definitely introduce an incompatibility. So the final
 question is: What do you (Vimmers) and you (Bram) think: is there a need
 for a change.

I don't know why you call this a column count, in most places it's
called a byte count.  Perhaps in some places in the docs the remark
about this actually being a byte count is missing.

You could also want a character count.  But what is a character when
using composing characters?  E.g., when the umlaut is not included in
a character but added as a separate composing character?

It's not so obvious what to do.  In these situations I rather keep it as
it is.

-- 
DENNIS: Look,  strange women lying on their backs in ponds handing out
swords ... that's no basis for a system of government.  Supreme
executive power derives from a mandate from the masses, not from some
farcical aquatic ceremony.
 Monty Python and the Holy Grail PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///