Re: NFS4 requires UTF-8

2002-02-21 Thread Gaspar Sinai



On Thu, 21 Feb 2002, Glenn Maynard wrote:

 On Thu, Feb 21, 2002 at 01:26:33PM +0900, Gaspar Sinai wrote:
  I just browsed through RFC-3010 and I found one thing that
  bothers me and it has not been discussed yet (I think).
 
  RFC says:
   The NFS version 4 protocol does not mandate the use
   of a particular  normalization form at this time.
 
  How do we mount something that contains a precomposed
  character like:
 
U+00E1 (Composed of U+0061 and U+0301)
 
  If the U+0061 U+0301 is used and our server is assumimg U+00E1,
  can a malicious hacker set up another NFS server that has
  U+0061 and U+0301 to mount his NFS volume? I could even
  imagine very tricky combinations with Vietnamese text
  but that would be another question...
 
  Forgive my ignorance if this was discuseed - I did not see it
  in the archives.

 One thing that's bound to be lost in the transition to UTF-8 filenames:
 the ability to reference any file on the filesystem with a pure CLI.
 If I see a file with a pi symbol in it, I simply can't type that; I have
 to copy and paste it or wildcard it.  If I have a filename with all
 Kanji, I can only use wildcards.

 A normalization form would help a lot, though. It'd guarantee that in
 all cases where I *do* know how to enter a character in a filename,
 I can always manipulate the file.  (If I see cár, I'd be able to cat
 cár and see it, reliably.)

 I don't know who would actually normalize filenames, though--a shell
 can't just normalize all args (not all args are filenames) and doing it
 in all tools would be unreliable.

 A mandatory normalization form would also eliminate visibly duplicate
 filenames.  Of course, it can't be enforced, but tools that escape
 filenames for output could change unnormalized text to \u/\U.

 I don't quite understand the scenario you're trying to describe, though.

What I was thinking is this:

NFS server may export  something that is meant to be the
same but in fact, because of lack of mandatory normalization,
it is different what the client tries to mount. Is it possible
for someone to use the same machine and export a different
volume with the same name as the client expects?

It may be a different question but can the machine name
be played with? Can this have an affect to the name of the
machine itself or only directories and filenames?

Thank you,
gaspar


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Pablo Saratxaga

Kaixo!

On Thu, Feb 21, 2002 at 03:10:32AM -0500, Glenn Maynard wrote:

 One thing that's bound to be lost in the transition to UTF-8 filenames:
 the ability to reference any file on the filesystem with a pure CLI.
 If I see a file with a pi symbol in it, I simply can't type that; I have
 to copy and paste it or wildcard it.  If I have a filename with all
 Kanji, I can only use wildcards.

Well, it won't happen often that you will have to manipulate files with
names including characters you cannot type.
Usually you manage your files, and it is you that typed their filenames.

kanji or pi letter can very well typed in a CLI environment; well,
using a japanese XIM and a greek keyboard respectively.

It isn't that much of a problem.
 
 A normalization form would help a lot, though. It'd guarantee that in

That however is indeed a problem.

A problem similar to the case-insensitivity in Windows (where you could,
at least with old versions, load a file named one way and save it
another way; if you were using a case sensitive fs (eg a fs on aunix mounted
by SMB on the windows machine) you ended up with different files and a
real mess.

The same thing could happen here; well, not as bad, as I don't think any
program will purposedly *change* the chars composing a filename previously
selected (eg when doing open then save there wouldn't be any name
change); but whe a user will type manually a filename it could happen
that the system will tell him no such filename and he will be puzzled
as he sees there is; as there is no visual difference betwen a precomposed
character like aacute and two characters a and composing acute accent.

This reminds me of a discussion in pango and the ability to have different
view and edit modes: normal (with text showing as expected), and another
mode where composing chars are de-composed, and invisible control characters
(such as zwj, etc) are made visible.

 I don't know who would actually normalize filenames, though--a shell
 can't just normalize all args (not all args are filenames) and doing it
 in all tools would be unreliable.

The normalization should be done at the input method layer; that way it will
be transparent and hopefully, if all OS do the same, the potential problem
of duplicates will never happen.


-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard

On Thu, Feb 21, 2002 at 11:08:24AM +0100, Radovan Garabik wrote:
  One thing that's bound to be lost in the transition to UTF-8 filenames:
  the ability to reference any file on the filesystem with a pure CLI.
  If I see a file with a pi symbol in it, I simply can't type that; I have
  to copy and paste it or wildcard it.  If I have a filename with all
  Kanji, I can only use wildcards.

(Er, meant copy and paste for the last; wildcards aren't useful for
selecting a filename where you can't enter *any* of the characters,
unless the length is unique.)

 sorry, but that is just plain impossible. For one thing, the c can 
 quite well be U+04AB, CYRILLIC SMALL LETTER ES, ditto for other 
 letters. But I agree that normalization can save us a lot of headache.

Normalization would catch the cases where it's impossible to tell from
context what it's likely to be.

 Input method should produce normalized characters. Since most
 filenames are somehow produced via human operation, it would 
 catch most of pathological cases.

Not just at the input method.  I'm in Windows; my input method produces wide
characters, which my terminal emulator catches and converts to UTF-8, so my
terminal would need to follow the same normalization as input methods in X.

Terminal compose keys and real keybindings (actual non-English
keyboards) are other things an IM isn't involved in; terminals and GUI
apps (or at least widget sets) would need to handle it directly.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Radovan Garabik

On Thu, Feb 21, 2002 at 11:23:20AM +, Edmund GRIMLEY EVANS wrote:
 
 I'm not even convinced that it's a good idea to force file names to be
 in UTF-8. Perhaps it would be simpler and more robust to let file
 names be any null-terminated string of octets and just recommend that
 people use (some normalisation form of) UTF-8. That way you won't have
 the problem of some files (with ill-formed names) being visible
 locally but not remotely because the server or the client is either
 blocking the names or normalising them in some weird and unexpected
 way.

Certainly, this kind of normalization is evil and should be avoided.
Normalization I am thinking about should ensure the filenames are stored
on the server in as sane a way as possible.

Once the filename is written to the fs, it should remain there and
transparently _without any change_ be exported to clients (be it
just a program doing open() or a remote network client). It could be
changed via mount option, like current linux NLS implementation, 
but in no other way.

-- 
 ---
| Radovan Garabik http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__garabik @ melkor.dnp.fmph.uniba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard

On Thu, Feb 21, 2002 at 11:59:14AM +0100, Pablo Saratxaga wrote:
 It isn't that much of a problem.

I think it's not a completely trivial loss, compared to an ASCII environment
where filenames were completely unambiguous (invalid characters being
escaped.)  There doesn't seem to be any obvious fix, so I suppose it's
just a price paid.

 The same thing could happen here; well, not as bad, as I don't think any
 program will purposedly *change* the chars composing a filename previously
 selected (eg when doing open then save there wouldn't be any name
 change); but whe a user will type manually a filename it could happen

If a program wants to operate in a normalized form internally, it might,
but that's probably asking for trouble anyway.

 that the system will tell him no such filename and he will be puzzled
 as he sees there is; as there is no visual difference betwen a precomposed
 character like aacute and two characters a and composing acute accent.

Should control characters ever end up in filenames?  I'd be surprised if
many terminal emulators handled copy and paste with control characters
well, if at all.  (They don't need to be drawn, so I'd expect most that
don't use them would just discard them.)

06:29am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBF`;'
06:29am [EMAIL PROTECTED]/2 [~/testing] ls

06:29am [EMAIL PROTECTED]/2 [~/testing] ls -l
total 0
-rw-r--r--1 glennusers   0 Feb 21 06:29

(rm)

06:31am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBFfile`;'
06:31am [EMAIL PROTECTED]/2 [~/testing] ls
file
06:31am [EMAIL PROTECTED]/2 [~/testing] cat file
cat: file: No such file or directory

I can't copy and paste it.  Wildcards wouldn't help much if I'd stuck BOM's
between letters (and *f*i*l*e* isn't very obvious, especially if you
don't know what's going on, or if one's not really the letter it looks
like), and tab completion may or may not help, depending on the shell.
(Someone mentioned moving everything out of the directory and rm -f'ing;
I should never have to do that.)

Are control characters (and all non-printing characters) useful in filenames
at all?  If not, they should be escaped, too, to avoid this kind of problem.

(Another one, perhaps: a character with a ton of combining characters on
top of it.  Most terminal emulators won't deal with an arbitrary number
of them.)

 This reminds me of a discussion in pango and the ability to have different
 view and edit modes: normal (with text showing as expected), and another
 mode where composing chars are de-composed, and invisible control characters
 (such as zwj, etc) are made visible.

Reveal codes for filenames? :)

  I don't know who would actually normalize filenames, though--a shell
  can't just normalize all args (not all args are filenames) and doing it
  in all tools would be unreliable.
 
 The normalization should be done at the input method layer; that way it will
 be transparent and hopefully, if all OS do the same, the potential problem
 of duplicates will never happen.

See my other response: characters are often entered in other ways than a
nice modularized input method; terminal emulators will need to behave in
the same way as IMs for this to work, as well as GUIs at some layer.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard

On Thu, Feb 21, 2002 at 11:23:20AM +, Edmund GRIMLEY EVANS wrote:
 People are advocating normalisation as a solution for various kinds of
 file name confusion, but I can imagine normalisation making things
 worse.
 
 For example, file names with a trailing space can certainly be
 confusing, but would life be any simpler if some programmer decided to
 strip trailing white space at some point in the processing of a file
 name? I don't think so. You would then potentially have files that are
 not just hard to delete, but impossible to delete.

If I have two computers, one sending precomposed and one not, I can't
access my câr file created on one on the other.  If terminal emulators,
IMs, etc. send normalized characters, this isn't a problem.  (It doesn't
fix all problems, but it would help fix up some of the major ones.)

Then, if a filename is being displayed by ls which doesn't fit the
normalization form expected in filenames, display it in a way that shows
what it really is.  (c\u00E2r.)  (Optional, of course.)  This is less
useful with the other unavoidable glyph ambiguities, though.

cat certainly shouldn't normalize its arguments.

 I'm not even convinced that it's a good idea to force file names to be
 in UTF-8. Perhaps it would be simpler and more robust to let file
 names be any null-terminated string of octets and just recommend that
 people use (some normalisation form of) UTF-8. That way you won't have
 the problem of some files (with ill-formed names) being visible
 locally but not remotely because the server or the client is either
 blocking the names or normalising them in some weird and unexpected
 way.

I'm not suggesting NFS normalize anything; this is just as important on
a single system being accessed from multiple terminals.

Sorry, the switch from NFS to filenames in general wasn't clear.

 What's so bad about just being 8-bit clean?

Oh, network protocols *should* be 8-bit clean for filenames (minus nul).
If I have a remote filename with an invalid filename (overlong UTF-8
sequence or just plain garbage), I'd better be able to access it over
NFS.  I don't think the FS (NFS, local filesystem, FTP, whatever) should
touch filenames at all.  (Mandating that they be UTF-8 in the standard
is a good thing; enforcing it at the FS layer is not.)

Related: I frequently can't touch filenames with non-English characters
over Samba, and filenames with characters Windows bans from filenames.
Windows displays them as some random-looking series of characters, and it
doesn't always map back correctly.  This doesn't really have anything to do
with the network protocol--though the actual implementation problem might
be in there--it's that it doesn't deal with invalid filename properly.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Pablo Saratxaga

Kaixo!

On Thu, Feb 21, 2002 at 06:50:27AM -0500, Glenn Maynard wrote:
 On Thu, Feb 21, 2002 at 11:59:14AM +0100, Pablo Saratxaga wrote:
  It isn't that much of a problem.
 
 I think it's not a completely trivial loss, compared to an ASCII environment
 where filenames were completely unambiguous 

I don't know; I have never used an ascii environment; I need at the
very least iso-8859-1 :)

  that the system will tell him no such filename and he will be puzzled
  as he sees there is; as there is no visual difference betwen a precomposed
  character like aacute and two characters a and composing acute accent.
 
 Should control characters ever end up in filenames?  I'd be surprised if
 many terminal emulators handled copy and paste with control characters
 well, if at all.

Well, it sometimes happen to me that I hit Ctrl-V by accident then
another key and end up with a filanem with escape and other ctrl sequences.

  The normalization should be done at the input method layer; that way it will
  be transparent and hopefully, if all OS do the same, the potential problem
  of duplicates will never happen.
 
 See my other response: characters are often entered in other ways than a
 nice modularized input method; terminal emulators will need to behave in
 the same way as IMs for this to work, as well as GUIs at some layer.

I consider as an input method too the whatever code that allows to type
dead keys and have accents, and the compose key etc.

A terminal emulator doesn't needs to do anything, they don't handle input
themselves (real terminal does; but terminal emulators are just another
window on the screen, like any other program, from the input perspective)

So, what should be addressed is an agreement on what should input methods,
keyboards, compose etc produce.

IMHO it should be normalized in a known and predictable way, and if possible
using the same normalization across systems and different operating systems,
so a same keystroke will produce the same result.


-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Radovan Garabik

On Fri, Feb 22, 2002 at 02:24:31AM +0900, Tomohiro KUBOTA wrote:
 Hi,
 
 At Thu, 21 Feb 2002 17:36:57 +0100,
 Keld J\370rn Simonsen wrote:
 
  I can type ¦ and ð directly from the keyboard with my standard
  X danish keyboard, just as easlily as I can type @. Cant you?
  
  If this is still a problem with some X keyboards, I would say that we
  should try then to enhance them. I did it for Danish, Norwegian,
  Swedish and Finnish X keyboards, and it should be done for others too.

are Swedish and Finnish keyboards different? I thought they use the same
layout (Finns giving up š and ž in favour of å)

  I do not know the status rigth now, but maybe we could make
  an overview of X keyboards in this respect. 
 
 I (Japanese) cannot.  Though I may be able to input them by some

neither can I (for most ISO-8859-1 characters). I usually just
hit Compose key and some combination vaguely resembling the char and
hope for the best - often it takes several tries to get the correct one.
I can enter Slovak characters easily, but I had to write my own xkb
map (standard one included in xfree86 was just unusable).

Btw is it possible (with xkb) to do something like per-map dead keys 
compose?
e.g. when I hit the dead key (dead_acute) with a vowel, I get 
accented vowel correctly, but I want (e.g.) the combination dead_acutes
to yield LATIN SMALL LETTER S WITH CARON. And similarly for other
combinations. 
I know I can hack up my own compose map, but:
1) that would mess up with other keyboard layouts
2) I want to retain Composeskey for acute to yield
   LATIN SMALL LETTER S WITH ACUTE


 settings, I don't know how.  It is just as average European people
 don't know how to input Kanji.

I would love to. Perhaps it would not be bad to write a compose 
map providing composekta to yield KATAKANA LETTER TA
and composehta HIRAGANA LETTER TA. Or something.


-- 
 ---
| Radovan Garabik http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__garabik @ melkor.dnp.fmph.uniba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: brocken bar and UCS keyboard

2002-02-21 Thread Keld Jørn Simonsen

On Thu, Feb 21, 2002 at 05:34:29PM +, Markus Kuhn wrote:
 Keld wrote on 2002-02-21 16:36 UTC:
  I can type ¦ and ð directly from the keyboard with my standard
  X danish keyboard
 
 I'm glad to hear that you are one of the ~12 people in Europe who know
 how to enter ¦ under XFree86 directly from the keyboard. [Even though
 the GB keyboard has an extra key location for ¦, it normally leads to
 the entry of |, because that is what 99.9997% of all people pressing
 this key actually wanted to enter (for shell pipe, C or, etc.)].

Well, I designed how to get there so I should know.
You are probably right that very few know. But it is on top of
the | character so it would be easy to guess.

 Perhaps you are even one of the 5 people in Europe who know what this
 character is good for and why it was needed in addition to |? (The
 standard excuse EBDIC compatibility does not count here ... ;-)

I dont know either :-)

 If we update the keyboard mappings, please do not give any special
 priority to ISO 8859-1 characters. There are far more important
 characters in UCS then full ISO 8859-1 coverage.

probably, but when I did that the 8859 charceters were the one that
was useful.

 In particular, very urgently missing on English keyboards is the EN
 DASH. I am fed up with seeing hyphen signs being used everywhere as
 dashes. It hurts my typographic eye and this abuse proves every day
 again that the historic keyboard layouts that were developped originally
 for monospaced ASCII/Latin-1 typewriters are utterly inadequate for
 contemporary word processing needs, with the massive abuse of the hyphen
 as a dash and minus (for which there are no officially designated keys)
 is the most significant worry.

so where should it go? alt-minus?
 
 Something has to be done by the keyboard standards community urgently.
 The application and printing community has fixed the problem long ago
 with the use of CP1252 and UCS, but users still have no clue about how
 to enter a dash or minus sign on their keyboard, and even under
 platforms such as Win32, each application has it's own conventions. Most
 national variants of ISO 9995 cover today only the repertoire of MES-1
 (ISO 6937 plus the EURO SIGN), which lack
 
   EN SPACE
   EM SPACE
   MINUS
 
 and other essential typographic characters. Nobody uses ISO 6937 and
 western keyboards really should cover the CP1252 subset of UCS properly,
 because that is what word processing files are encoded in today, and
 that reflects actual needs.
 
 How do we fix this in the keyboard standards and how do we get the fix
 onto the market? Any suggestions?

It is really hard to get something done. What we can do is something
with X. Getting the physical layout is much harder. Unless you
want to split the keyboard and take off the keys and rearrange them.
Could be done. Costs some money. But you can do it in a small
scale and then try to pull it off in the big. But try to think
of DVORAC keyboards, they never took off. I have tried to persuade
Cherry to introduce some plug-and-play indification so the
keyborad could identify itself when asked, but without luck yet.
Everything else nowadays identifies itself on a system.

we can make em space happen in X, and en space. And minus.
With current keyboards. As I almost exclusively run linux that
would make me happy. 

Keld
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: brocken bar and UCS keyboard

2002-02-21 Thread Markus Kuhn

Keld Simonsen wrote:
  How do we fix this in the keyboard standards and how do we get the fix
  onto the market? Any suggestions?
 
 It is really hard to get something done. What we can do is something
 with X. Getting the physical layout is much harder. Unless you
 want to split the keyboard and take off the keys and rearrange them.
 Could be done. Costs some money. But you can do it in a small
 scale and then try to pull it off in the big. But try to think
 of DVORAC keyboards, they never took off.

Try to think of the Windows keys on the other hand ...

 I have tried to persuade
 Cherry to introduce some plug-and-play indification so the
 keyborad could identify itself when asked, but without luck yet.
 Everything else nowadays identifies itself on a system.

I'm typing this on a USB keyboard, which identifies its layout (well,
actually more a sort of keyboard-specific country code, nothing really
well-engineered; complaints to [EMAIL PROTECTED]) to the operating
system.

http://www.usb.org/developers/docs.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: http://www.cl.cam.ac.uk/~mgk25/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard

By the way, to all of the people threading on inputting other language
text: I was showing a loss from ASCII--you can't type all filenames
because some of them will have characters you can't necessarily type.
This was a minor point, since (as I've said) it can't really be fixed.

(Well, it could be fixed, but not cleanly.)

OTOH, the unprinting character problem is important.  Would it be
reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output,
ie ls -b), or is there some reasonable use of them in filenames?

Combining characters at the beginning of a filename probably shouldn't be
output literally, either.

On Thu, Feb 21, 2002 at 03:33:40PM +, Markus Kuhn wrote:
  One thing that's bound to be lost in the transition to UTF-8 filenames:
  the ability to reference any file on the filesystem with a pure CLI.
 
 I can generate plenty of file names with ISO 8859-1 that you will have
 troubles typing in. Try a file name that starts with CR or NBSP just to
 warm up. Nothing new with UTF-8 here. Keep it simple.

02:01pm [EMAIL PROTECTED]/5 [~/testing] touch 
dquote hello
02:01pm [EMAIL PROTECTED]/5 [~/testing] ls
\nhello

ls escapes the control character.  If I'm not in escape mode, it outputs
a question mark; it never outputs it literally.  It doesn't do this for
Unicode unprinting characters.

(NBSP isn't a problem here, since it can be copy-and-pasted.)

 Just like with the file £¤¥¦§¨©ª« I guess. Has that been a problem
 in practice so far?

That can still be copy-and-pasted; the control character examples can not.
Overly combined characters probably couldn't, either.

 We agreed already ages ago here that Normalization Form C should be
 considered to be recommended practice under Linux and on the Web. But

Then we're in agreement.

 nothing should prevent you in the future from using arbitrary opaque
 byte strings as POSIX file names. In particular, POSIX forbids that the
 file system applies any sort of normalization automatically. All the URL
 security issues that IIS on NTFS had demonstrates, what a wise decision
 that was.

 Please do not even think about automatically normalizing file names
 anywhere. There is absolutely no need for introducing such nonsense, and
 deviating from the POSIX requirement that filenames be opaque byte
 strings is a Bad Idea[TM] (also known as NTFS).

Nobody's disagreeing on any of this.

 No, it won't. Unicode normalization will not eliminate homoglyphs and
 can't possibly. You try to apply the wrong tool to the wrong problem.
 Again nothing new here. We have lived happily for over a decade with the
 homoglyphs SP and NBSP in ISO 8859-1 in POSIX file systems. Security
 problems have arousen in file systems that attempted to do case
 invariant matching and other forms of normalization and now we know that
 that was a bad idea (see the web attack log I posted here 2002-02-14
 as one example).

(this has been said already)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: brocken bar and UCS keyboard

2002-02-21 Thread Keld Jørn Simonsen

On Thu, Feb 21, 2002 at 09:54:24PM +, Markus Kuhn wrote:
 Keld Simonsen wrote:
   How do we fix this in the keyboard standards and how do we get the fix
   onto the market? Any suggestions?
  
  It is really hard to get something done. What we can do is something
  with X. Getting the physical layout is much harder. Unless you
  want to split the keyboard and take off the keys and rearrange them.
  Could be done. Costs some money. But you can do it in a small
  scale and then try to pull it off in the big. But try to think
  of DVORAC keyboards, they never took off.
 
 Try to think of the Windows keys on the other hand ...

Yes, but we are not Microsoft. Anyway we could come close to
that position. But is it not a lot more that you want than
what Microsoft, one of the biggest and most powerful companies
in our business, could accomplish? What do you have in mind?

Or maybe some inputting point and click is what we want for
inputting 10646?

  I have tried to persuade
  Cherry to introduce some plug-and-play indification so the
  keyborad could identify itself when asked, but without luck yet.
  Everything else nowadays identifies itself on a system.
 
 I'm typing this on a USB keyboard, which identifies its layout (well,
 actually more a sort of keyboard-specific country code, nothing really
 well-engineered; complaints to [EMAIL PROTECTED]) to the operating
 system.

is this a general feature for all usb keyboards?
is this something we are employing for X?
A kind of kbdsuperprobe?

Kind regards
keld
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Keld Jørn Simonsen

On Thu, Feb 21, 2002 at 05:36:44PM -0500, Glenn Maynard wrote:
 By the way, to all of the people threading on inputting other language
 text: I was showing a loss from ASCII--you can't type all filenames
 because some of them will have characters you can't necessarily type.
 This was a minor point, since (as I've said) it can't really be fixed.
 
 (Well, it could be fixed, but not cleanly.)

I think the compose way is pretty clean.
An point-and-click method would also be clean
and the 9995 UCS method is pretty clean too, or what?

 (NBSP isn't a problem here, since it can be copy-and-pasted.)

or just typed in as alt-gr-space
 
  Just like with the file £¤¥¦§¨©ª« I guess. Has that been a problem
  in practice so far?
 
 That can still be copy-and-pasted; the control character examples can not.
 Overly combined characters probably couldn't, either.

I have typed in  most control characters with
ctrl-v ctrlletter in question - no big deal.

Kind regards
keld
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Thoughts on keyboard layout input

2002-02-21 Thread Keld Jørn Simonsen

I see some requirements on X in Radovans posting:

We need some general assignments of control keys across the
different keyboards, such as what is meta on a 101 keyboard, 104,
105. And is it doable? I think it is with current X architecture.
Are the keys bound in the standard configuration? probably not.

How does MS do it? (I seldomly use their OSes).
If we really want Linux and X to be a major OS, I think
there is no need to invent things for changing between
windows, if MS already have a convenient way of doing it.
Or maybe this is already taken care of in X window manages
such as sawmill.

But I would really like if X could have defaults for the
standard keyboards, capable of generating 10646.

Keld
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Pablo Saratxaga

Kaixo!

On Thu, Feb 21, 2002 at 05:36:23PM -0500, Glenn Maynard wrote:
 
 OTOH, the unprinting character problem is important.  Would it be
 reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output,
 ie ls -b), or is there some reasonable use of them in filenames?

There are reasonable use of zwj and zwnj and similar, they are needed
for proper writing in some languages.

In fact, all the trouble comes from the xterm, not from ls.

I would say that ls should not escape them, only invalid utf-8 and
control chars.

then, another command line switch should be added to escape all but
printable ascii.

more complex options are not to be done in the command line on an xterm,
a graphical toolkit is more suited for that.
the reason is that with ls/xterm the rendering and the tool handling the
filenames are dissociated, so you cannot easily do interesting things,
you can however on an open or save etc dialog box have a way to
set the properties of the text box that shows the file name, and have
it display as normal, display zero width chars (in a better way than
ugly \ notation, like squares with the hexa value or mnemonic, like
in yudit editor); or a mode to dis-shape (useful to see the difference
between precompsed or not letter, and the ambiguos ones with several
composing chars, like it could happen in vietnamese or thai, etc)

So, the only interesting change that would be worth doing for the
use of utf-8 in filenames will be an extra switch to ls to quote
everything but ascii, and ensure it quotes incorrect utf-8 when the
locale is in utf-8 mode.

for the special viewing modes in graphical toolkits, it is a general purpose
feature, usefull for all widgets dealing with text displaying (and for use
by power users, but that is also the case of the bizarre filenames we
are talking about, the standard use will never be faced with those
strange cases, and if it happens some day he will just turn to the man or
woman that he usually turns to for similar complexity problems).


-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: NFS4 requires UTF-8

2002-02-21 Thread Glenn Maynard
On Fri, Feb 22, 2002 at 12:55:31AM +0100, Pablo Saratxaga wrote:
  OTOH, the unprinting character problem is important.  Would it be
  reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output,
  ie ls -b), or is there some reasonable use of them in filenames?
 
 There are reasonable use of zwj and zwnj and similar, they are needed
 for proper writing in some languages.
 
 In fact, all the trouble comes from the xterm, not from "ls".

If a filename is a BOM followed by "hello", how can I enter it?  I
don't expect my terminal emulator to remember all control characters
sent at any cursor position and paste them along with other characters,
so I'd end up pasting "hello" alone.  It's worse when the filename is
*only* unprinting characters, and there's nothing on screen to copy at
all.  (That's just plain confusing, too.)

We can't blame the terminal for not being able to copy and paste
arbitrary sequences of bytes.  It's not ls's "fault" either, per se (it's
inherent), but that doesn't mean it can't help.

 I would say that ls should not escape them, only invalid utf-8 and
 control chars.
 
 then, another command line switch should be added to "escape all but
 printable ascii".

Well, I'd like all nonprinting characters escaped, but not, say, $BF|K\8l(B.
That means I can copy and paste the filename, and characters that *can*
be copied and pasted aren't escaped.  (but see below)

 more complex options are not to be done in the command line on an xterm,
 a graphical toolkit is more suited for that.

It's acceptable to go from "able to type all filenames with the
keyboard" to "need to copy and paste filenames which I can't type
directly".  That's reasonable (if only because it's unavoidable).  (As
has been pointed out, it's already there in ISO-8859-1.)

It's not acceptable to have filenames that I can't access from a CLI
(with C+P) reliably at all (or that I need to switch to a special ls mode
that escapes *everything* over ASCII to access.)  Wildcards are a useful
fallback, but they don't stand alone--it still wouldn't help me target a
file consisting only of control characters, for example.  Telling me to
"use a GUI" is simply no good.  (I'm not installing X on a 486 running
FTP to delete a file someone dumped in my /incoming.)

Files are an extremely fundamental part of a Unix system, and all fundamental
parts of Unix are accessible from a CLI.  That's always been one of its
greatest strengths, and we can't throw that away for filenames.  This is
why GNU ls supports escaping.

 the reason is that with ls/xterm the rendering and the tool handling the
 filenames are dissociated, so you cannot easily do interesting things,

ls supports escaping that matches bash's.  (\ooo, \xHH, \n, etc.) If this
is extended to include \u and \U, then ls can be extended to 
allow (optionally, for the sake of compatibility) displaying escape
characters, etc. in that form.

(I think that extension is useful, whether or not ls uses it.)

Just because the tools aren't maintained by the same person doesn't mean
there can't be cooperation.  (Though, considering how difficult it's
proving to be to get UTF-8 support at all in bash, I don't expect *all*
shells to support this.)

This doesn't involve xterm (or any terminal) at all, just the shell and
tools.

 So, the only interesting change that would be worth doing for the
 use of utf-8 in filenames will be an extra switch to ls to quote
 everything but ascii, and ensure it quotes incorrect utf-8 when the
 locale is in utf-8 mode.

I disagree; I think it's interesting, useful and practical to escape
certain other cases.  Leading combining characters, probably, and any
characters not useful in filenames.  (Of course, it's not necessarily
easy to determine what's useful.  I don't see BIDI support in filenames
as useful--that seems to be a property of whatever text is displaying
the filenames, not the filename themselves--but I'm not a BIDI user, so
I can only guess.)

I'm unclear on how control characters that change state behave in
filenames at all.  To pick a simple example, what if a filename contains
the language code "zh"?  I can no longer do a simple C program that
outputs "The first file is %s.  The second file is %s. [...]" as the
text after the first %s is marked Chinese.  (This probably won't break
anything, but other control characters probably would.)  Invalidate all
state after outputting a filename?  Complicated.  (I don't know what zwj
and zwnj do; perhaps a more practical example could be made with them.)
Anyone feel like filling me in here?

This would be like enbedding ANSI color sequences in filenames and ls
letting it through: the color would bleed onto the next line unless ls
knew to reset the color after each filename.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


Re: brocken bar and UCS keyboard

2002-02-21 Thread Henry Spencer

On Thu, 21 Feb 2002, David Starner wrote:
 Software being too smart is usually a pain, unless they've got the
 read-my-mind code working right. Especially here - how do you
 distinguish between the hyphen, the em-dash, the minus and the soft
 hyphen? Any sort of software-smarts is going to have to be heavily
 backed up by user-smarts.

No question there, but I think you have missed my point.  The most crucial
step is simply to get people to realize that there is more than one symbol
involved and that the choice matters.  So long as hitting the - key always
gets them hyphen, that's not going to happen.  Having them grumble that
the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. 

 There is a step between shift-alt-meta and printed on the keycaps. An
 English (non-programmers) keyboard could be designed and distributed
 in software. It's not impossible that Microsoft could support such a
 thing and keyboard manufacturers start making the things, meaning the
 next generation actually reliably gets it right.

You're still dodging the crucial problem, which is getting people to
change their touch-typing habits to actually *use* the new symbols.

  Henry Spencer
   [EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: brocken bar and UCS keyboard

2002-02-21 Thread Glenn Maynard

On Thu, Feb 21, 2002 at 09:49:01PM -0500, Henry Spencer wrote:
 No question there, but I think you have missed my point.  The most crucial
 step is simply to get people to realize that there is more than one symbol
 involved and that the choice matters.  So long as hitting the - key always
 gets them hyphen, that's not going to happen.  Having them grumble that
 the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. 

When they're visibly very similar, do you think most users are going to
use them right, no matter how accessible they are?  Hyphen and dash are
distinct (most people who use dashes also know that you need two hyphens
to act as a dash, not one), but a single hyphen looks reasonable as a
minus sign in most fonts.  A real minus sign usually looks better, but
I doubt most people will care enough to want to learn the difference
between *four* different characters on their keyboard that generate a
horizontal line--hyphen, dash, minus and underscore.

If they won't do that, they won't even consider changing their typing
habits.

Would you add separate open double quote, close double quote,
open single quote, close single quote, neutral single and double quotes,
apostrophe and backtick keys, too?  They're all useful, but
that's one heck of a keyboard.  :)

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: brocken bar and UCS keyboard

2002-02-21 Thread David Starner

On Thu, Feb 21, 2002 at 09:49:01PM -0500, Henry Spencer wrote:
  There is a step between shift-alt-meta and printed on the keycaps. An
  English (non-programmers) keyboard could be designed and distributed
  in software. It's not impossible that Microsoft could support such a
  thing and keyboard manufacturers start making the things, meaning the
  next generation actually reliably gets it right.
 
 You're still dodging the crucial problem, which is getting people to
 change their touch-typing habits to actually *use* the new symbols.

Why is that crucial? You can lead a horse to water, but you can't make
them drink. People will use whatever orthographies they want. Make it
reasonable and feasible for people to do the right thing, and let time
and social pressure move them in the right direction. Look at where
we've gone on the whole `quote' issue. Hopefully in another 10 years,
a lot of people will be using curved quotes - another thing it's bloody
impossible to get from the keyboard.

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: brocken bar and UCS keyboard

2002-02-21 Thread David Starner

On Thu, Feb 21, 2002 at 10:09:20PM -0500, Glenn Maynard wrote:
 Would you add separate open double quote, close double quote,
 open single quote, close single quote, neutral single and double quotes,
 apostrophe and backtick keys, too?  They're all useful, but
 that's one heck of a keyboard.  :)

No. I'd get rid of the neutral quotes, the apostrophe and backtick. I
don't know about everyone else, but I could live with switching between
a programmer's/Unix keyboard, with #'`~^*_\/| on it and one that has,
say, curved quotes, Euro, dead keys for French and German, and daggers.

-- 
David Starner / Давид Старнэр - [EMAIL PROTECTED]
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, Peace and Love, Inc.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: brocken bar and UCS keyboard

2002-02-21 Thread Henry Spencer

On Thu, 21 Feb 2002, Glenn Maynard wrote:
  ...Having them grumble that
  the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. 
 
 When they're visibly very similar, do you think most users are going to
 use them right, no matter how accessible they are?

Possibly not.  But teaching people to make this distinction was exactly
what was originally asked for, at the start of this branch of the
discussion.  The issue *wasn't* how a handful of cognoscenti could more
easily type the symbols in question. 

I think there is some small hope that proper usage could *eventually*
become a well-known sign of careful composition, in the same way that
proper use of uppercase and lowercase letters is now.  Note that I say
some small hope, not a near certainty.  But I do not think there is
any chance at all if people see only hyphens in their output; that
encourages them to believe that there is no distinction to be made, that
hyphen is proper for all purposes. 

  Henry Spencer
   [EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/