Re: NFS4 requires UTF-8
On Thu, 21 Feb 2002, Glenn Maynard wrote: On Thu, Feb 21, 2002 at 01:26:33PM +0900, Gaspar Sinai wrote: I just browsed through RFC-3010 and I found one thing that bothers me and it has not been discussed yet (I think). RFC says: The NFS version 4 protocol does not mandate the use of a particular normalization form at this time. How do we mount something that contains a precomposed character like: U+00E1 (Composed of U+0061 and U+0301) If the U+0061 U+0301 is used and our server is assumimg U+00E1, can a malicious hacker set up another NFS server that has U+0061 and U+0301 to mount his NFS volume? I could even imagine very tricky combinations with Vietnamese text but that would be another question... Forgive my ignorance if this was discuseed - I did not see it in the archives. One thing that's bound to be lost in the transition to UTF-8 filenames: the ability to reference any file on the filesystem with a pure CLI. If I see a file with a pi symbol in it, I simply can't type that; I have to copy and paste it or wildcard it. If I have a filename with all Kanji, I can only use wildcards. A normalization form would help a lot, though. It'd guarantee that in all cases where I *do* know how to enter a character in a filename, I can always manipulate the file. (If I see cár, I'd be able to cat cár and see it, reliably.) I don't know who would actually normalize filenames, though--a shell can't just normalize all args (not all args are filenames) and doing it in all tools would be unreliable. A mandatory normalization form would also eliminate visibly duplicate filenames. Of course, it can't be enforced, but tools that escape filenames for output could change unnormalized text to \u/\U. I don't quite understand the scenario you're trying to describe, though. What I was thinking is this: NFS server may export something that is meant to be the same but in fact, because of lack of mandatory normalization, it is different what the client tries to mount. Is it possible for someone to use the same machine and export a different volume with the same name as the client expects? It may be a different question but can the machine name be played with? Can this have an affect to the name of the machine itself or only directories and filenames? Thank you, gaspar -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
Kaixo! On Thu, Feb 21, 2002 at 03:10:32AM -0500, Glenn Maynard wrote: One thing that's bound to be lost in the transition to UTF-8 filenames: the ability to reference any file on the filesystem with a pure CLI. If I see a file with a pi symbol in it, I simply can't type that; I have to copy and paste it or wildcard it. If I have a filename with all Kanji, I can only use wildcards. Well, it won't happen often that you will have to manipulate files with names including characters you cannot type. Usually you manage your files, and it is you that typed their filenames. kanji or pi letter can very well typed in a CLI environment; well, using a japanese XIM and a greek keyboard respectively. It isn't that much of a problem. A normalization form would help a lot, though. It'd guarantee that in That however is indeed a problem. A problem similar to the case-insensitivity in Windows (where you could, at least with old versions, load a file named one way and save it another way; if you were using a case sensitive fs (eg a fs on aunix mounted by SMB on the windows machine) you ended up with different files and a real mess. The same thing could happen here; well, not as bad, as I don't think any program will purposedly *change* the chars composing a filename previously selected (eg when doing open then save there wouldn't be any name change); but whe a user will type manually a filename it could happen that the system will tell him no such filename and he will be puzzled as he sees there is; as there is no visual difference betwen a precomposed character like aacute and two characters a and composing acute accent. This reminds me of a discussion in pango and the ability to have different view and edit modes: normal (with text showing as expected), and another mode where composing chars are de-composed, and invisible control characters (such as zwj, etc) are made visible. I don't know who would actually normalize filenames, though--a shell can't just normalize all args (not all args are filenames) and doing it in all tools would be unreliable. The normalization should be done at the input method layer; that way it will be transparent and hopefully, if all OS do the same, the potential problem of duplicates will never happen. -- Ki ça vos våye bén, Pablo Saratxaga http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 11:08:24AM +0100, Radovan Garabik wrote: One thing that's bound to be lost in the transition to UTF-8 filenames: the ability to reference any file on the filesystem with a pure CLI. If I see a file with a pi symbol in it, I simply can't type that; I have to copy and paste it or wildcard it. If I have a filename with all Kanji, I can only use wildcards. (Er, meant copy and paste for the last; wildcards aren't useful for selecting a filename where you can't enter *any* of the characters, unless the length is unique.) sorry, but that is just plain impossible. For one thing, the c can quite well be U+04AB, CYRILLIC SMALL LETTER ES, ditto for other letters. But I agree that normalization can save us a lot of headache. Normalization would catch the cases where it's impossible to tell from context what it's likely to be. Input method should produce normalized characters. Since most filenames are somehow produced via human operation, it would catch most of pathological cases. Not just at the input method. I'm in Windows; my input method produces wide characters, which my terminal emulator catches and converts to UTF-8, so my terminal would need to follow the same normalization as input methods in X. Terminal compose keys and real keybindings (actual non-English keyboards) are other things an IM isn't involved in; terminals and GUI apps (or at least widget sets) would need to handle it directly. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 11:23:20AM +, Edmund GRIMLEY EVANS wrote: I'm not even convinced that it's a good idea to force file names to be in UTF-8. Perhaps it would be simpler and more robust to let file names be any null-terminated string of octets and just recommend that people use (some normalisation form of) UTF-8. That way you won't have the problem of some files (with ill-formed names) being visible locally but not remotely because the server or the client is either blocking the names or normalising them in some weird and unexpected way. Certainly, this kind of normalization is evil and should be avoided. Normalization I am thinking about should ensure the filenames are stored on the server in as sane a way as possible. Once the filename is written to the fs, it should remain there and transparently _without any change_ be exported to clients (be it just a program doing open() or a remote network client). It could be changed via mount option, like current linux NLS implementation, but in no other way. -- --- | Radovan Garabik http://melkor.dnp.fmph.uniba.sk/~garabik/ | | __..--^^^--..__garabik @ melkor.dnp.fmph.uniba.sk | --- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 11:59:14AM +0100, Pablo Saratxaga wrote: It isn't that much of a problem. I think it's not a completely trivial loss, compared to an ASCII environment where filenames were completely unambiguous (invalid characters being escaped.) There doesn't seem to be any obvious fix, so I suppose it's just a price paid. The same thing could happen here; well, not as bad, as I don't think any program will purposedly *change* the chars composing a filename previously selected (eg when doing open then save there wouldn't be any name change); but whe a user will type manually a filename it could happen If a program wants to operate in a normalized form internally, it might, but that's probably asking for trouble anyway. that the system will tell him no such filename and he will be puzzled as he sees there is; as there is no visual difference betwen a precomposed character like aacute and two characters a and composing acute accent. Should control characters ever end up in filenames? I'd be surprised if many terminal emulators handled copy and paste with control characters well, if at all. (They don't need to be drawn, so I'd expect most that don't use them would just discard them.) 06:29am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBF`;' 06:29am [EMAIL PROTECTED]/2 [~/testing] ls 06:29am [EMAIL PROTECTED]/2 [~/testing] ls -l total 0 -rw-r--r--1 glennusers 0 Feb 21 06:29 (rm) 06:31am [EMAIL PROTECTED]/2 [~/testing] perl -e '`touch \xEF\xBB\xBFfile`;' 06:31am [EMAIL PROTECTED]/2 [~/testing] ls file 06:31am [EMAIL PROTECTED]/2 [~/testing] cat file cat: file: No such file or directory I can't copy and paste it. Wildcards wouldn't help much if I'd stuck BOM's between letters (and *f*i*l*e* isn't very obvious, especially if you don't know what's going on, or if one's not really the letter it looks like), and tab completion may or may not help, depending on the shell. (Someone mentioned moving everything out of the directory and rm -f'ing; I should never have to do that.) Are control characters (and all non-printing characters) useful in filenames at all? If not, they should be escaped, too, to avoid this kind of problem. (Another one, perhaps: a character with a ton of combining characters on top of it. Most terminal emulators won't deal with an arbitrary number of them.) This reminds me of a discussion in pango and the ability to have different view and edit modes: normal (with text showing as expected), and another mode where composing chars are de-composed, and invisible control characters (such as zwj, etc) are made visible. Reveal codes for filenames? :) I don't know who would actually normalize filenames, though--a shell can't just normalize all args (not all args are filenames) and doing it in all tools would be unreliable. The normalization should be done at the input method layer; that way it will be transparent and hopefully, if all OS do the same, the potential problem of duplicates will never happen. See my other response: characters are often entered in other ways than a nice modularized input method; terminal emulators will need to behave in the same way as IMs for this to work, as well as GUIs at some layer. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 11:23:20AM +, Edmund GRIMLEY EVANS wrote: People are advocating normalisation as a solution for various kinds of file name confusion, but I can imagine normalisation making things worse. For example, file names with a trailing space can certainly be confusing, but would life be any simpler if some programmer decided to strip trailing white space at some point in the processing of a file name? I don't think so. You would then potentially have files that are not just hard to delete, but impossible to delete. If I have two computers, one sending precomposed and one not, I can't access my câr file created on one on the other. If terminal emulators, IMs, etc. send normalized characters, this isn't a problem. (It doesn't fix all problems, but it would help fix up some of the major ones.) Then, if a filename is being displayed by ls which doesn't fit the normalization form expected in filenames, display it in a way that shows what it really is. (c\u00E2r.) (Optional, of course.) This is less useful with the other unavoidable glyph ambiguities, though. cat certainly shouldn't normalize its arguments. I'm not even convinced that it's a good idea to force file names to be in UTF-8. Perhaps it would be simpler and more robust to let file names be any null-terminated string of octets and just recommend that people use (some normalisation form of) UTF-8. That way you won't have the problem of some files (with ill-formed names) being visible locally but not remotely because the server or the client is either blocking the names or normalising them in some weird and unexpected way. I'm not suggesting NFS normalize anything; this is just as important on a single system being accessed from multiple terminals. Sorry, the switch from NFS to filenames in general wasn't clear. What's so bad about just being 8-bit clean? Oh, network protocols *should* be 8-bit clean for filenames (minus nul). If I have a remote filename with an invalid filename (overlong UTF-8 sequence or just plain garbage), I'd better be able to access it over NFS. I don't think the FS (NFS, local filesystem, FTP, whatever) should touch filenames at all. (Mandating that they be UTF-8 in the standard is a good thing; enforcing it at the FS layer is not.) Related: I frequently can't touch filenames with non-English characters over Samba, and filenames with characters Windows bans from filenames. Windows displays them as some random-looking series of characters, and it doesn't always map back correctly. This doesn't really have anything to do with the network protocol--though the actual implementation problem might be in there--it's that it doesn't deal with invalid filename properly. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
Kaixo! On Thu, Feb 21, 2002 at 06:50:27AM -0500, Glenn Maynard wrote: On Thu, Feb 21, 2002 at 11:59:14AM +0100, Pablo Saratxaga wrote: It isn't that much of a problem. I think it's not a completely trivial loss, compared to an ASCII environment where filenames were completely unambiguous I don't know; I have never used an ascii environment; I need at the very least iso-8859-1 :) that the system will tell him no such filename and he will be puzzled as he sees there is; as there is no visual difference betwen a precomposed character like aacute and two characters a and composing acute accent. Should control characters ever end up in filenames? I'd be surprised if many terminal emulators handled copy and paste with control characters well, if at all. Well, it sometimes happen to me that I hit Ctrl-V by accident then another key and end up with a filanem with escape and other ctrl sequences. The normalization should be done at the input method layer; that way it will be transparent and hopefully, if all OS do the same, the potential problem of duplicates will never happen. See my other response: characters are often entered in other ways than a nice modularized input method; terminal emulators will need to behave in the same way as IMs for this to work, as well as GUIs at some layer. I consider as an input method too the whatever code that allows to type dead keys and have accents, and the compose key etc. A terminal emulator doesn't needs to do anything, they don't handle input themselves (real terminal does; but terminal emulators are just another window on the screen, like any other program, from the input perspective) So, what should be addressed is an agreement on what should input methods, keyboards, compose etc produce. IMHO it should be normalized in a known and predictable way, and if possible using the same normalization across systems and different operating systems, so a same keystroke will produce the same result. -- Ki ça vos våye bén, Pablo Saratxaga http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Fri, Feb 22, 2002 at 02:24:31AM +0900, Tomohiro KUBOTA wrote: Hi, At Thu, 21 Feb 2002 17:36:57 +0100, Keld J\370rn Simonsen wrote: I can type ¦ and ð directly from the keyboard with my standard X danish keyboard, just as easlily as I can type @. Cant you? If this is still a problem with some X keyboards, I would say that we should try then to enhance them. I did it for Danish, Norwegian, Swedish and Finnish X keyboards, and it should be done for others too. are Swedish and Finnish keyboards different? I thought they use the same layout (Finns giving up š and ž in favour of å) I do not know the status rigth now, but maybe we could make an overview of X keyboards in this respect. I (Japanese) cannot. Though I may be able to input them by some neither can I (for most ISO-8859-1 characters). I usually just hit Compose key and some combination vaguely resembling the char and hope for the best - often it takes several tries to get the correct one. I can enter Slovak characters easily, but I had to write my own xkb map (standard one included in xfree86 was just unusable). Btw is it possible (with xkb) to do something like per-map dead keys compose? e.g. when I hit the dead key (dead_acute) with a vowel, I get accented vowel correctly, but I want (e.g.) the combination dead_acutes to yield LATIN SMALL LETTER S WITH CARON. And similarly for other combinations. I know I can hack up my own compose map, but: 1) that would mess up with other keyboard layouts 2) I want to retain Composeskey for acute to yield LATIN SMALL LETTER S WITH ACUTE settings, I don't know how. It is just as average European people don't know how to input Kanji. I would love to. Perhaps it would not be bad to write a compose map providing composekta to yield KATAKANA LETTER TA and composehta HIRAGANA LETTER TA. Or something. -- --- | Radovan Garabik http://melkor.dnp.fmph.uniba.sk/~garabik/ | | __..--^^^--..__garabik @ melkor.dnp.fmph.uniba.sk | --- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, Feb 21, 2002 at 05:34:29PM +, Markus Kuhn wrote: Keld wrote on 2002-02-21 16:36 UTC: I can type ¦ and ð directly from the keyboard with my standard X danish keyboard I'm glad to hear that you are one of the ~12 people in Europe who know how to enter ¦ under XFree86 directly from the keyboard. [Even though the GB keyboard has an extra key location for ¦, it normally leads to the entry of |, because that is what 99.9997% of all people pressing this key actually wanted to enter (for shell pipe, C or, etc.)]. Well, I designed how to get there so I should know. You are probably right that very few know. But it is on top of the | character so it would be easy to guess. Perhaps you are even one of the 5 people in Europe who know what this character is good for and why it was needed in addition to |? (The standard excuse EBDIC compatibility does not count here ... ;-) I dont know either :-) If we update the keyboard mappings, please do not give any special priority to ISO 8859-1 characters. There are far more important characters in UCS then full ISO 8859-1 coverage. probably, but when I did that the 8859 charceters were the one that was useful. In particular, very urgently missing on English keyboards is the EN DASH. I am fed up with seeing hyphen signs being used everywhere as dashes. It hurts my typographic eye and this abuse proves every day again that the historic keyboard layouts that were developped originally for monospaced ASCII/Latin-1 typewriters are utterly inadequate for contemporary word processing needs, with the massive abuse of the hyphen as a dash and minus (for which there are no officially designated keys) is the most significant worry. so where should it go? alt-minus? Something has to be done by the keyboard standards community urgently. The application and printing community has fixed the problem long ago with the use of CP1252 and UCS, but users still have no clue about how to enter a dash or minus sign on their keyboard, and even under platforms such as Win32, each application has it's own conventions. Most national variants of ISO 9995 cover today only the repertoire of MES-1 (ISO 6937 plus the EURO SIGN), which lack EN SPACE EM SPACE MINUS and other essential typographic characters. Nobody uses ISO 6937 and western keyboards really should cover the CP1252 subset of UCS properly, because that is what word processing files are encoded in today, and that reflects actual needs. How do we fix this in the keyboard standards and how do we get the fix onto the market? Any suggestions? It is really hard to get something done. What we can do is something with X. Getting the physical layout is much harder. Unless you want to split the keyboard and take off the keys and rearrange them. Could be done. Costs some money. But you can do it in a small scale and then try to pull it off in the big. But try to think of DVORAC keyboards, they never took off. I have tried to persuade Cherry to introduce some plug-and-play indification so the keyborad could identify itself when asked, but without luck yet. Everything else nowadays identifies itself on a system. we can make em space happen in X, and en space. And minus. With current keyboards. As I almost exclusively run linux that would make me happy. Keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
Keld Simonsen wrote: How do we fix this in the keyboard standards and how do we get the fix onto the market? Any suggestions? It is really hard to get something done. What we can do is something with X. Getting the physical layout is much harder. Unless you want to split the keyboard and take off the keys and rearrange them. Could be done. Costs some money. But you can do it in a small scale and then try to pull it off in the big. But try to think of DVORAC keyboards, they never took off. Try to think of the Windows keys on the other hand ... I have tried to persuade Cherry to introduce some plug-and-play indification so the keyborad could identify itself when asked, but without luck yet. Everything else nowadays identifies itself on a system. I'm typing this on a USB keyboard, which identifies its layout (well, actually more a sort of keyboard-specific country code, nothing really well-engineered; complaints to [EMAIL PROTECTED]) to the operating system. http://www.usb.org/developers/docs.html Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: http://www.cl.cam.ac.uk/~mgk25/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
By the way, to all of the people threading on inputting other language text: I was showing a loss from ASCII--you can't type all filenames because some of them will have characters you can't necessarily type. This was a minor point, since (as I've said) it can't really be fixed. (Well, it could be fixed, but not cleanly.) OTOH, the unprinting character problem is important. Would it be reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output, ie ls -b), or is there some reasonable use of them in filenames? Combining characters at the beginning of a filename probably shouldn't be output literally, either. On Thu, Feb 21, 2002 at 03:33:40PM +, Markus Kuhn wrote: One thing that's bound to be lost in the transition to UTF-8 filenames: the ability to reference any file on the filesystem with a pure CLI. I can generate plenty of file names with ISO 8859-1 that you will have troubles typing in. Try a file name that starts with CR or NBSP just to warm up. Nothing new with UTF-8 here. Keep it simple. 02:01pm [EMAIL PROTECTED]/5 [~/testing] touch dquote hello 02:01pm [EMAIL PROTECTED]/5 [~/testing] ls \nhello ls escapes the control character. If I'm not in escape mode, it outputs a question mark; it never outputs it literally. It doesn't do this for Unicode unprinting characters. (NBSP isn't a problem here, since it can be copy-and-pasted.) Just like with the file £¤¥¦§¨©ª« I guess. Has that been a problem in practice so far? That can still be copy-and-pasted; the control character examples can not. Overly combined characters probably couldn't, either. We agreed already ages ago here that Normalization Form C should be considered to be recommended practice under Linux and on the Web. But Then we're in agreement. nothing should prevent you in the future from using arbitrary opaque byte strings as POSIX file names. In particular, POSIX forbids that the file system applies any sort of normalization automatically. All the URL security issues that IIS on NTFS had demonstrates, what a wise decision that was. Please do not even think about automatically normalizing file names anywhere. There is absolutely no need for introducing such nonsense, and deviating from the POSIX requirement that filenames be opaque byte strings is a Bad Idea[TM] (also known as NTFS). Nobody's disagreeing on any of this. No, it won't. Unicode normalization will not eliminate homoglyphs and can't possibly. You try to apply the wrong tool to the wrong problem. Again nothing new here. We have lived happily for over a decade with the homoglyphs SP and NBSP in ISO 8859-1 in POSIX file systems. Security problems have arousen in file systems that attempted to do case invariant matching and other forms of normalization and now we know that that was a bad idea (see the web attack log I posted here 2002-02-14 as one example). (this has been said already) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, Feb 21, 2002 at 09:54:24PM +, Markus Kuhn wrote: Keld Simonsen wrote: How do we fix this in the keyboard standards and how do we get the fix onto the market? Any suggestions? It is really hard to get something done. What we can do is something with X. Getting the physical layout is much harder. Unless you want to split the keyboard and take off the keys and rearrange them. Could be done. Costs some money. But you can do it in a small scale and then try to pull it off in the big. But try to think of DVORAC keyboards, they never took off. Try to think of the Windows keys on the other hand ... Yes, but we are not Microsoft. Anyway we could come close to that position. But is it not a lot more that you want than what Microsoft, one of the biggest and most powerful companies in our business, could accomplish? What do you have in mind? Or maybe some inputting point and click is what we want for inputting 10646? I have tried to persuade Cherry to introduce some plug-and-play indification so the keyborad could identify itself when asked, but without luck yet. Everything else nowadays identifies itself on a system. I'm typing this on a USB keyboard, which identifies its layout (well, actually more a sort of keyboard-specific country code, nothing really well-engineered; complaints to [EMAIL PROTECTED]) to the operating system. is this a general feature for all usb keyboards? is this something we are employing for X? A kind of kbdsuperprobe? Kind regards keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Thu, Feb 21, 2002 at 05:36:44PM -0500, Glenn Maynard wrote: By the way, to all of the people threading on inputting other language text: I was showing a loss from ASCII--you can't type all filenames because some of them will have characters you can't necessarily type. This was a minor point, since (as I've said) it can't really be fixed. (Well, it could be fixed, but not cleanly.) I think the compose way is pretty clean. An point-and-click method would also be clean and the 9995 UCS method is pretty clean too, or what? (NBSP isn't a problem here, since it can be copy-and-pasted.) or just typed in as alt-gr-space Just like with the file £¤¥¦§¨©ª« I guess. Has that been a problem in practice so far? That can still be copy-and-pasted; the control character examples can not. Overly combined characters probably couldn't, either. I have typed in most control characters with ctrl-v ctrlletter in question - no big deal. Kind regards keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Thoughts on keyboard layout input
I see some requirements on X in Radovans posting: We need some general assignments of control keys across the different keyboards, such as what is meta on a 101 keyboard, 104, 105. And is it doable? I think it is with current X architecture. Are the keys bound in the standard configuration? probably not. How does MS do it? (I seldomly use their OSes). If we really want Linux and X to be a major OS, I think there is no need to invent things for changing between windows, if MS already have a convenient way of doing it. Or maybe this is already taken care of in X window manages such as sawmill. But I would really like if X could have defaults for the standard keyboards, capable of generating 10646. Keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
Kaixo! On Thu, Feb 21, 2002 at 05:36:23PM -0500, Glenn Maynard wrote: OTOH, the unprinting character problem is important. Would it be reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output, ie ls -b), or is there some reasonable use of them in filenames? There are reasonable use of zwj and zwnj and similar, they are needed for proper writing in some languages. In fact, all the trouble comes from the xterm, not from ls. I would say that ls should not escape them, only invalid utf-8 and control chars. then, another command line switch should be added to escape all but printable ascii. more complex options are not to be done in the command line on an xterm, a graphical toolkit is more suited for that. the reason is that with ls/xterm the rendering and the tool handling the filenames are dissociated, so you cannot easily do interesting things, you can however on an open or save etc dialog box have a way to set the properties of the text box that shows the file name, and have it display as normal, display zero width chars (in a better way than ugly \ notation, like squares with the hexa value or mnemonic, like in yudit editor); or a mode to dis-shape (useful to see the difference between precompsed or not letter, and the ambiguos ones with several composing chars, like it could happen in vietnamese or thai, etc) So, the only interesting change that would be worth doing for the use of utf-8 in filenames will be an extra switch to ls to quote everything but ascii, and ensure it quotes incorrect utf-8 when the locale is in utf-8 mode. for the special viewing modes in graphical toolkits, it is a general purpose feature, usefull for all widgets dealing with text displaying (and for use by power users, but that is also the case of the bizarre filenames we are talking about, the standard use will never be faced with those strange cases, and if it happens some day he will just turn to the man or woman that he usually turns to for similar complexity problems). -- Ki ça vos våye bén, Pablo Saratxaga http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975 -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: NFS4 requires UTF-8
On Fri, Feb 22, 2002 at 12:55:31AM +0100, Pablo Saratxaga wrote: OTOH, the unprinting character problem is important. Would it be reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output, ie ls -b), or is there some reasonable use of them in filenames? There are reasonable use of zwj and zwnj and similar, they are needed for proper writing in some languages. In fact, all the trouble comes from the xterm, not from "ls". If a filename is a BOM followed by "hello", how can I enter it? I don't expect my terminal emulator to remember all control characters sent at any cursor position and paste them along with other characters, so I'd end up pasting "hello" alone. It's worse when the filename is *only* unprinting characters, and there's nothing on screen to copy at all. (That's just plain confusing, too.) We can't blame the terminal for not being able to copy and paste arbitrary sequences of bytes. It's not ls's "fault" either, per se (it's inherent), but that doesn't mean it can't help. I would say that ls should not escape them, only invalid utf-8 and control chars. then, another command line switch should be added to "escape all but printable ascii". Well, I'd like all nonprinting characters escaped, but not, say, $BF|K\8l(B. That means I can copy and paste the filename, and characters that *can* be copied and pasted aren't escaped. (but see below) more complex options are not to be done in the command line on an xterm, a graphical toolkit is more suited for that. It's acceptable to go from "able to type all filenames with the keyboard" to "need to copy and paste filenames which I can't type directly". That's reasonable (if only because it's unavoidable). (As has been pointed out, it's already there in ISO-8859-1.) It's not acceptable to have filenames that I can't access from a CLI (with C+P) reliably at all (or that I need to switch to a special ls mode that escapes *everything* over ASCII to access.) Wildcards are a useful fallback, but they don't stand alone--it still wouldn't help me target a file consisting only of control characters, for example. Telling me to "use a GUI" is simply no good. (I'm not installing X on a 486 running FTP to delete a file someone dumped in my /incoming.) Files are an extremely fundamental part of a Unix system, and all fundamental parts of Unix are accessible from a CLI. That's always been one of its greatest strengths, and we can't throw that away for filenames. This is why GNU ls supports escaping. the reason is that with ls/xterm the rendering and the tool handling the filenames are dissociated, so you cannot easily do interesting things, ls supports escaping that matches bash's. (\ooo, \xHH, \n, etc.) If this is extended to include \u and \U, then ls can be extended to allow (optionally, for the sake of compatibility) displaying escape characters, etc. in that form. (I think that extension is useful, whether or not ls uses it.) Just because the tools aren't maintained by the same person doesn't mean there can't be cooperation. (Though, considering how difficult it's proving to be to get UTF-8 support at all in bash, I don't expect *all* shells to support this.) This doesn't involve xterm (or any terminal) at all, just the shell and tools. So, the only interesting change that would be worth doing for the use of utf-8 in filenames will be an extra switch to ls to quote everything but ascii, and ensure it quotes incorrect utf-8 when the locale is in utf-8 mode. I disagree; I think it's interesting, useful and practical to escape certain other cases. Leading combining characters, probably, and any characters not useful in filenames. (Of course, it's not necessarily easy to determine what's useful. I don't see BIDI support in filenames as useful--that seems to be a property of whatever text is displaying the filenames, not the filename themselves--but I'm not a BIDI user, so I can only guess.) I'm unclear on how control characters that change state behave in filenames at all. To pick a simple example, what if a filename contains the language code "zh"? I can no longer do a simple C program that outputs "The first file is %s. The second file is %s. [...]" as the text after the first %s is marked Chinese. (This probably won't break anything, but other control characters probably would.) Invalidate all state after outputting a filename? Complicated. (I don't know what zwj and zwnj do; perhaps a more practical example could be made with them.) Anyone feel like filling me in here? This would be like enbedding ANSI color sequences in filenames and ls letting it through: the color would bleed onto the next line unless ls knew to reset the color after each filename. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, 21 Feb 2002, David Starner wrote: Software being too smart is usually a pain, unless they've got the read-my-mind code working right. Especially here - how do you distinguish between the hyphen, the em-dash, the minus and the soft hyphen? Any sort of software-smarts is going to have to be heavily backed up by user-smarts. No question there, but I think you have missed my point. The most crucial step is simply to get people to realize that there is more than one symbol involved and that the choice matters. So long as hitting the - key always gets them hyphen, that's not going to happen. Having them grumble that the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. There is a step between shift-alt-meta and printed on the keycaps. An English (non-programmers) keyboard could be designed and distributed in software. It's not impossible that Microsoft could support such a thing and keyboard manufacturers start making the things, meaning the next generation actually reliably gets it right. You're still dodging the crucial problem, which is getting people to change their touch-typing habits to actually *use* the new symbols. Henry Spencer [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, Feb 21, 2002 at 09:49:01PM -0500, Henry Spencer wrote: No question there, but I think you have missed my point. The most crucial step is simply to get people to realize that there is more than one symbol involved and that the choice matters. So long as hitting the - key always gets them hyphen, that's not going to happen. Having them grumble that the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. When they're visibly very similar, do you think most users are going to use them right, no matter how accessible they are? Hyphen and dash are distinct (most people who use dashes also know that you need two hyphens to act as a dash, not one), but a single hyphen looks reasonable as a minus sign in most fonts. A real minus sign usually looks better, but I doubt most people will care enough to want to learn the difference between *four* different characters on their keyboard that generate a horizontal line--hyphen, dash, minus and underscore. If they won't do that, they won't even consider changing their typing habits. Would you add separate open double quote, close double quote, open single quote, close single quote, neutral single and double quotes, apostrophe and backtick keys, too? They're all useful, but that's one heck of a keyboard. :) -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, Feb 21, 2002 at 09:49:01PM -0500, Henry Spencer wrote: There is a step between shift-alt-meta and printed on the keycaps. An English (non-programmers) keyboard could be designed and distributed in software. It's not impossible that Microsoft could support such a thing and keyboard manufacturers start making the things, meaning the next generation actually reliably gets it right. You're still dodging the crucial problem, which is getting people to change their touch-typing habits to actually *use* the new symbols. Why is that crucial? You can lead a horse to water, but you can't make them drink. People will use whatever orthographies they want. Make it reasonable and feasible for people to do the right thing, and let time and social pressure move them in the right direction. Look at where we've gone on the whole `quote' issue. Hopefully in another 10 years, a lot of people will be using curved quotes - another thing it's bloody impossible to get from the keyboard. -- David Starner / Давид Старнэр - [EMAIL PROTECTED] What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, Feb 21, 2002 at 10:09:20PM -0500, Glenn Maynard wrote: Would you add separate open double quote, close double quote, open single quote, close single quote, neutral single and double quotes, apostrophe and backtick keys, too? They're all useful, but that's one heck of a keyboard. :) No. I'd get rid of the neutral quotes, the apostrophe and backtick. I don't know about everyone else, but I could live with switching between a programmer's/Unix keyboard, with #'`~^*_\/| on it and one that has, say, curved quotes, Euro, dead keys for French and German, and daggers. -- David Starner / Давид Старнэр - [EMAIL PROTECTED] What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, Peace and Love, Inc. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: brocken bar and UCS keyboard
On Thu, 21 Feb 2002, Glenn Maynard wrote: ...Having them grumble that the stupid software keeps picking the wrong one would be an *IMPROVEMENT*. When they're visibly very similar, do you think most users are going to use them right, no matter how accessible they are? Possibly not. But teaching people to make this distinction was exactly what was originally asked for, at the start of this branch of the discussion. The issue *wasn't* how a handful of cognoscenti could more easily type the symbols in question. I think there is some small hope that proper usage could *eventually* become a well-known sign of careful composition, in the same way that proper use of uppercase and lowercase letters is now. Note that I say some small hope, not a near certainty. But I do not think there is any chance at all if people see only hyphens in their output; that encourages them to believe that there is no distinction to be made, that hyphen is proper for all purposes. Henry Spencer [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/