Re: Request for discussion - how to make MC unicode capable

2007-03-11 Thread Pavel Tsekov
Hello Denis,

On Sun, 11 Mar 2007, Denis Vlasenko wrote:

 Side note: why mc developer resources are scarce? think about that...

We'd better start a new thread for that discussion.

Thanks!
___
Mc-devel mailing list
http://mail.gnome.org/mailman/listinfo/mc-devel


Re: Request for discussion - how to make MC unicode capable

2007-03-10 Thread Denis Vlasenko
On Saturday 24 February 2007 13:57, Pavel Tsekov wrote:
 Hello,
 
 I'd like to initiate a discussion on how to make MC
 unicode deal with multibyte character sets. I'd like
 to hear from the developers of the UTF-8 patch and
 from the ncurses maintaner. Anyone else who can help
 with their expertise is also welcome. This has been
 a major drawback for quite some time and it needs to
 be addressed ASAP.

I'd say that in the long run Unicode and its
most useful encoding, UTF-8, is going to be the most
widespread. At the very minimum, mc should be usable
on pure UTF-8 systems (systems whose terminals and
filenames are in UTF-8). Currently it is not.

Regarding displaying texts in various encodings:
GNU recode is still available, don't try to do
all things for all people at once.
--
vda
___
Mc-devel mailing list
http://mail.gnome.org/mailman/listinfo/mc-devel


Re: Request for discussion - how to make MC unicode capable

2007-03-10 Thread Denis Vlasenko
On Monday 26 February 2007 13:17, Egmont Koblinger wrote:
 On Sat, Feb 24, 2007 at 02:57:44PM +0200, Pavel Tsekov wrote:
  I'd like to initiate a discussion on how to make MC
  unicode deal with multibyte character sets.
...
...
   examine whether dropping support for one of these libraries would save
   noticable developer resources or not. At this moment the resources to
   develop mc are IMHO much tighter than the resources on any site where
   either ncurses or slang has to be installed in order to install mc.

Side note: why mc developer resources are scarce? think about that...
--
vda
___
Mc-devel mailing list
http://mail.gnome.org/mailman/listinfo/mc-devel


Re: Request for discussion - how to make MC unicode capable

2007-02-27 Thread Pavel Roskin
On Sat, 2007-02-24 at 14:57 +0200, Pavel Tsekov wrote:
 Hello,
 
 I'd like to initiate a discussion on how to make MC
 unicode deal with multibyte character sets. I'd like
 to hear from the developers of the UTF-8 patch and
 from the ncurses maintaner. Anyone else who can help
 with their expertise is also welcome. This has been
 a major drawback for quite some time and it needs to
 be addressed ASAP.

Yes, thank you for addressing this issue!  I just want to give you some
general advice based on my experience.

Don't try to keep backward compatibility from the beginning, no matter
how important it is.  Code for the most advanced API first, and then
backport the changes to older APIs if needed.

The main reason is that the new API introduces new concepts.  The
concepts are based on better understanding if the issue.  Retaining the
code that is not based on those conceptions next to the new code would
create a maintenance nightmare.  In some cases, the new API enforces the
new rules.  Don't let the offenders to hide behind conditional
statements.

In case of Unicode, the new concept is distinction between bytes and
characters.  Many functions need to be checked that they don't mix them.
It's totally impractical to write a preprocessor conditional every time
something is changed.  It's better to change to code for Unicode support
and then think how to provide backward compatibility for the whole
source tree with minimal changes throughout the code.

Another reason is that the programmer's time is very expensive and
should be used properly.  A programmer should be testing how his code is
working rather than whether it compiles for an old libc.  Very few
actual bugs (i.e. incorrect runtime behavior, as opposed to often
trivial compile issues) are discovered as a result of portability
problems.  Much more bugs are discovered on the primary development
system by the main developer.

People opposing the changes are often more vocal that those who need the
changes.  The later category may not be using mc at all.  Perhaps they
tried mc and didn't like how it looked on the Unicode capable terminal.
Or maybe they were affected by bugs caused by distribution patches.

Those who don't want the changes can be usually satisfied by later
changes that restore the old behavior or the resource consumption.
Again, existing users could be asked to contribute portability fixes and
optimization.  It's an easier job than converting the code to the new
concepts and untangling the mess of function interdependencies.

And those who threaten to switch to different software or to fork the
project are usually not very good contributors to begin with.  The won't
be missed.

In more practical terms, I suggest that mc uses only ncurses or S-Lang
for Unicode.  Doing two ports would exhaust already limited resources.
I think the preference should be given to ncurses because it's not
trying to be an interpreted language or anything else other than a
screen library.

-- 
Regards,
Pavel Roskin

___
Mc-devel mailing list
http://mail.gnome.org/mailman/listinfo/mc-devel


Re: Request for discussion - how to make MC unicode capable

2007-02-27 Thread Thomas Dickey
On Tue, 27 Feb 2007, Pavel Roskin wrote:

 On Sat, 2007-02-24 at 14:57 +0200, Pavel Tsekov wrote:

 In case of Unicode, the new concept is distinction between bytes and
 characters.  Many functions need to be checked that they don't mix them.
 It's totally impractical to write a preprocessor conditional every time
 something is changed.  It's better to change to code for Unicode support
 and then think how to provide backward compatibility for the whole
 source tree with minimal changes throughout the code.

That's what I did with dialog: the largest change was to make editing
of a multibyte/multicolumn string work properly.  In doing that, I added
useful functions that could be reused to make forms line up, etc.

-- 
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net
___
Mc-devel mailing list
http://mail.gnome.org/mailman/listinfo/mc-devel


Re: Request for discussion - how to make MC unicode capable

2007-02-26 Thread Egmont Koblinger
On Sat, Feb 24, 2007 at 02:57:44PM +0200, Pavel Tsekov wrote:

 I'd like to initiate a discussion on how to make MC
 unicode deal with multibyte character sets.

Hi,

Here are some of my thoughts:

- First of all, before doing any work, this is a must read for everyone:
  http://joelonsoftware.com/articles/Unicode.html
  One of the main points is: from the users' point of view, it absolutely
  doesn't matter what bytes there are, the only thing that matters is that
  the users should see every _letter_ correctly on the display. Byte
  sequences must always be converted accordingly. On the other hand, we'll
  see that it's often a must for mc to keep byte sequences unchanged. The
  other main point is: for _all_ the byte sequences, inside mc, in the
  config and history file, in the vfs interface, everywhere, you _must_ know
  in which character set the string stands there.

- Currently KDE has much more bugs with accented filenames than Gnome has.
  This is probably because they have a different philosophy. Gnome treats
  filenames as byte sequences (as every Unix does) and only converts them to
  characters for displaying purposes; while KDE treats them as character
  sequences (QString or something like that). Probably due to this, KDE has
  a lot of troubles, it is absolutely unable to correctly handle filenames
  that are invalid byte sequences according to the locale, and it often
  performs extra, erroneous conversions. So I think the right way is to
  internally _think_ in byte sequences, and only convert it to/from
  characters when displaying them or doing regexp matches and so on.

- Similar goes for file contents. Even in a UTF-8 environment, people want
  to display (read) and edit files with different encoding, and even if
  every text file used UTF-8 there would be other (non text) files too. We
  shouldn't drop support for editing binary files, hex editor mode and so
  on.

- When the author of the well-known text editor joe began to implement
  UTF-8 support, I helped him with advices and later with bug reports. (He
  managed to implement a working version 2 weeks after he first heard of
  UTF-8 :-)) The result is IMHO a very well designed editor and I'd prefer
  to see similar in mcview/mcedit. In order to help people immigrate from
  8-bit charset to UTF-8, and in order to be able to view older files, it's
  important to support different file encoding and terminal charset. For
  example, it should be possible to view a Latin-1 file inside a Latin-1 mc,
  to view an UTF-8 file in a Latin-1 mc (replacing non-representable
  characters with an inverted question mark or something like that), to view
  a Latin-1 file in an UTF-8 mc, and to view an UTF-8 file in an UTF-8 mc.

  - The terminal charset should be taken from nl_langinfo(CODESET) (that is,
the LANG, LC_CTYPE and LC_ALL variables) and (as opposed to vim) I do
believe that there should be _no_ way to override it in mc. No-one can
expect correct behavior from any terminal application if these variables
do not reflect the terminal's actual encoding, so it's the users' or
software vendors' job to set it correctly, there should be no reason why
anyone may want to fix it only in one particular application. MC is not
the place to fix it, and once it's fixed outside mc, mc should not
provide an option to mess it. (I have no experience with platforms that
lack locale support, in such platforms it might make sense to create a
terminal encoding option, the need for this could be detected by the
./configure script.)

  - The file encoding should probably default to the terminal encoding, but
should be easily altered in the viewer or editor (and in fact, some
auto-detection might be added, e.g. if the file is not valid UTF-8 then
automatically fall back to the locale's legacy charset, or automatically
assume UTF-8 if the file is valid. Joe does have two boolean options
whether to enable these two ways of auto-guessing file encoding.) This
setting alters the way the file's content is interpreted (displayed on
the screen, searched case insensitively etc.) and alters how the pressed
keys are inserted in the file, but does not alter the file itself (i.e. 
do not perform iconv on it). This way the editor remains completely
binary-safe. Obviously displaying the file requires conversion from the
file encoding to the terminal encoding; interpreting pressed keys
requires conversation in the reverse way).

- Currently mc with the UTF-8 patches have a bug: when you run it in UTF-8
  environment and copy a file whose name is invalid UTF-8 (copy means F5
  then Enter) then the file name is mangled: the invalid part (characters
  that are _shown_ as question marks) are replaced with literal question
  marks. Care should be taken to always _think_ in bytes and only convert to
  characters for displaying and similar purposes, so that the byte sequences
  always remain 

Re: Request for discussion - how to make MC unicode capable

2007-02-26 Thread Vladimir Nadvornik
On Sunday 25 February 2007 14:41, Leonard den Ottolander wrote:
 Hello Pavel,

 On Sat, 2007-02-24 at 14:57 +0200, Pavel Tsekov wrote:
  I'd like to initiate a discussion on how to make MC
  unicode deal with multibyte character sets.
 

The current utf-8 patches are based on utf-8 support in glibc.
I don't know if utf-8 is needed on other systems.

 
 Just a few thoughts:

 - Because multibyte is rather more memory hungry I think the user should
 still have the option to toggle the use of an 8bit path either in the
 interface or at compile time. This means where the UTF-8 patches replace
 paths we should preferably implement two paths.

The situation with the utf-8 patches is following:
In editor the utf-8 charset is converted to wchar. This requires 4 times
more of memory, but allows to keep the code almost the same.
In the rest of mc the utf-8 charset is used directly and the memory
requirements are more or less the same as with 8bit charsets.


 - I suppose a lot of the code of the UTF-8 patch can be reused, only we
 will need to add iconv() calls in the appropriate places. libiconv is
 already expected so not much trouble with the make files there. Iconv
 should only be used for the multibyte path, not the 8bit path. Using the
 multibyte path would still enable users to translate from one 8bit
 charset to another.
 - Unsupported character substitution character should be an ini option
 (and define some defaults for all/many character sets). (I'm not sure
 question mark is supported in all character sets.)
 - Users should be able to set character set per directory (mount). Of
 course there should be a system wide default taken from the environment
 (but also overridable).
 - Copy/move dialogs should have a toggle to iconv the file name or do a
 binary name copy.
 - Maybe copy/move dialogs should also have a toggle to iconv file
 content, which could be quite usable for text files. A warning dialog on
 every copy/move (that the user explicitly has to disable) might be a
 good addition then, to help uninformed users avoiding to screw up their
 data.


The code in charsets.c is not compatible with utf-8 and needs to be completely 
rewritten. For example, the function convert_to_display(char *str) can't be 
used for converting to utf-8 where the string actually grows.

With the current utf-8 patches charsets can't be used in utf-8 locales.


-- 
Vladimir Nadvornik
developer
-  
SUSE LINUX, s. r. o.e-mail: [EMAIL PROTECTED]
Lihovarská 1060/12  tel:+420 284 028 967
190 00 Praha 9  fax:+420 284 028 951
Czech Republic  http://www.suse.cz
___
Mc-devel mailing list
http://mail.gnome.org/mailman/listinfo/mc-devel


Re: Request for discussion - how to make MC unicode capable

2007-02-26 Thread Egmont Koblinger
On Sun, Feb 25, 2007 at 02:41:45PM +0100, Leonard den Ottolander wrote:
 Just a few thoughts:
 
 - Because multibyte is rather more memory hungry I think the user should
 still have the option to toggle the use of an 8bit path either in the
 interface or at compile time. This means where the UTF-8 patches replace
 paths we should preferably implement two paths.

Multibyte is memory hungry if you use UCS-4 internally, which I don't
recommend (e.g. viewing a 10MB log file would need 40MB of memory - this
would really be awful). But if you use Latin-1, UTF-8, whatever internally,
there's no problem. My proposal is to still store the original byte
sequences in memory, in this case memory consumption doesn't grow in the
8bit case.

On the other hand, separate execution paths should be avoided as much as
possible, I hope it's needless to explain why. Most of the glibc functions
and the wrappers we could write to them are perfectly able to handle every
charset, no matter if it's 8bit or UTF-8 or something else. E.g. if we
implement a general mbstrlen() that returns the number of Unicode entities,
and strwidth() that returns the with, they'll work both in UTF-8 and in
Latin-1. In Latin-1 they'll always return the same value, but it's not worth
branching the code just because of this and use separate code for the 8-bit
cases. Just write and test one piece of code: the general case that covers
both UTF-8 and the 8bit ones, and probably EUC_JP and others too.

 - I suppose a lot of the code of the UTF-8 patch can be reused, only we
 will need to add iconv() calls in the appropriate places. libiconv is
 already expected so not much trouble with the make files there. Iconv
 should only be used for the multibyte path, not the 8bit path. Using the
 multibyte path would still enable users to translate from one 8bit
 charset to another.

As said above, I think different paths should be avoided. As discussed in my
previous mail, the story is not so simple black and white (8-bit vs. utf8),
there are mixed scenarios as well (viewing an utf-8 file in a latin1
terminal or a latin1 file in an utf8 terminal etc...)

 - Unsupported character substitution character should be an ini option
 (and define some defaults for all/many character sets). (I'm not sure
 question mark is supported in all character sets.)

I don't think mc should support any non ASCII compatible (e.g. EBCDIC)
character sets. They'd make things much-much more complicated and would
result in a feature probably no-one would ever use. Question mark is
available in all other character sets.

I really don't care if the unsupported character (e.g. if mc wants to
display a kanji but is unable to do since the terminal is latin1) is
configurable or not (actually a hardcoded inverted question mark is fine for
me) -- but it's not an important issue at all.

If a UTF-8 terminal is used (which is now the case in most of the Linux
distributions, at least in every distribution that matters (in my eyes)),
then U+FFFD is the right character for invalid byte sequences, as well as it
could also be used (I think) to denote non-printable (!iswprint())
characters.

 - Users should be able to set character set per directory (mount). Of
 course there should be a system wide default taken from the environment
 (but also overridable).

No. MC should not try to fix what's incorrect even outside mc. For vfat-like
file systems, there's the iocharset= mount option. For Linux file systems,
there's no such option, so it really sucks if you have two file systems
using two different encodings, but if it bothers you, either use convmv() to
convert these files, or patch the kernel to support iocharset=... and
filesystemcharset=... mount option for Linux file systems that does this
conversion. If there's no way for you to see the filenames correctly
throughout your system with the echo or ls commands, it's not mc's job to
fix it.

I do believe there are many many things to do in mc with small developer
resources. I can't even see when mc will be able to properly support UTF-8
on systems that are properly set up. Based on the experiences of converting
a full distro from Latin-2 to UTF-8, I must say that mc is the only
important software where UTF-8 would be necessary but the mainstream version
completely lacks it. It is quite urgent to do something with it.
Unfortunately neither I can investigate much time in it. Let's try to do no
more than supporting properly set up systems. Let's not try to provide
workarounds for system misconfigurations and such.

I believe that if you see different filename encodings on your system then
your system is not probably set up. Go and fix it _there_ and leave mc's
developers alone. Having various encodings _inside_ the text files is a
different issue however, mc should deal with it...

And, by the way, is there any reason not to trust in environment variables
and provide a way to override them? Yet again it's a system configuration
issue: if your env vars are 

Re: Request for discussion - how to make MC unicode capable

2007-02-25 Thread Leonard den Ottolander
Hello Pavel,

On Sat, 2007-02-24 at 14:57 +0200, Pavel Tsekov wrote:
 I'd like to initiate a discussion on how to make MC
 unicode deal with multibyte character sets.

Just a few thoughts:

- Because multibyte is rather more memory hungry I think the user should
still have the option to toggle the use of an 8bit path either in the
interface or at compile time. This means where the UTF-8 patches replace
paths we should preferably implement two paths.
- I suppose a lot of the code of the UTF-8 patch can be reused, only we
will need to add iconv() calls in the appropriate places. libiconv is
already expected so not much trouble with the make files there. Iconv
should only be used for the multibyte path, not the 8bit path. Using the
multibyte path would still enable users to translate from one 8bit
charset to another.
- Unsupported character substitution character should be an ini option
(and define some defaults for all/many character sets). (I'm not sure
question mark is supported in all character sets.)
- Users should be able to set character set per directory (mount). Of
course there should be a system wide default taken from the environment
(but also overridable).
- Copy/move dialogs should have a toggle to iconv the file name or do a
binary name copy.
- Maybe copy/move dialogs should also have a toggle to iconv file
content, which could be quite usable for text files. A warning dialog on
every copy/move (that the user explicitly has to disable) might be a
good addition then, to help uninformed users avoiding to screw up their
data.

These are the things I can come up with so far.

Leonard.

-- 
mount -t life -o ro /dev/dna /genetic/research


___
Mc-devel mailing list
http://mail.gnome.org/mailman/listinfo/mc-devel