Re: ffs and utf8
Dmitrij had some questions about my intent, I'll try to clarify. 2014/12/02 18:57 Joel Rees joel.r...@gmail.com: (apologies for the html.) 2014/12/02 9:52 Dmitrij D. Czarkoff czark...@gmail.com: [ ... and others Snipped context: There was some discussion of what kind of file names should be allowed to be stored. There was something I read as a suggestion for using a normal form based in Unicode as a target for enforced file name conversion. There were some attempts to discuss reasons why file names should not be forceably converted. And then communication seemed to really break down when I tried to present a semi-obvious example of why seemingly innocuous conversions turn out to be not so innocuous after all.] And, since that didn't work, I tried with an example closer to the suggested normal form: Joel Rees said: Now, what would you do with this? ã¸ã§ã¨ã« Why not decompose it to the following? ï½¼ï¾ï½®ï½´ï¾ Which didn't communicate the problem, either. Because it is not what Unicode normalization is. Well, it definitely isn't Unicode normalization. And there is a reason, it isn't, even though there were many who thought the Unicode standard shouldn't include code points for wide form glyphs. Let's try one more. I think you have said enough that I can infer that your preferred normal form is the decomposit form. So, given that your normalization has resulted in a file named ã·ãã§ã¨ã«ã®æ and given the necessity to send it back where it came from, how do you know whether or not it should be restored to ã¸ã§ã¨ã«ã®æ before you send it back? [...] But normalization is a red herring in this context. You may personally have no problems with filename conversions improperly done, but I am not willing to take them lightly where my data is concerned. I may have a NAS device that I'm using for backup without compression/amalgamation (i. e., tar/zip), and If I have a file with a decomposit name backed up on the NAS, I don't want it automatically converted to composit when it is restored, the existence of normal forms notwithstanding. Unix file names can handle UTF-8 encoded Unicode file names without losing data because no conversion is necessary. There may be issues with displaying them, but the file name itself is safe, because '/' is always '/' and '\0' is always '\0'. You can even handle broken UTF-8 and unconverted UTF-16/32 of whatever byte order spit into the file name as a sequence of bytes if and only if you escape NUL, slash, and your escape character properly, restoring the escaped characters when putting the file names on the network. Normalization alone does not know how to restore a potentially normalized name. It needs some sort of flag character that says this name was normalized, and a way to choose between de-normalized forms when more than one denormalized form maps to one particular normal form. The last time I looked, the Unicode standard itself stated that this was the case, and that normalized forms were not recomended for such purposes. The craziness currently infecting the entire industry leaves me with no confidence that such is still the case. I haven't used Apple OSses since around 10.4, but Mac OS X was doing a thing where certain well-known directory names were aliased according to the current locale. For instance, the user's music directory was shown as ãé³æ¥½ã when the locale was set to ja_JP.UTF-8. This is useful to desktop users, but is sometimes confusing when you log in via ssh from a terminal that does not display Japanese and fails to declare itself as such. It's convenient, but even this can cause problems when backing up the entire home or user directory, if the backup software doesn't know to ask for the OS canonical name. Again, apologies for using my (erk) Android device and spitting html at the list. Joel Rees Computer memory is just fancy paper, CPUs just fancy pens. All is a stream of text flowing from the past into the future.
Re: ffs and utf8
Joel Rees writes: You can even handle broken UTF-8 and unconverted UTF-16/32 of whatever byte order spit into the file name as a sequence of bytes if and only if you escape NUL, slash, and your escape character properly, restoring the escaped characters when putting the file names on the network. This is just asking for security issues. It's the same kind of thinking that caused the designers of Java to allow embedding NUL in strings as 0xc0 0x80, or CESU-8 where you can encode astral characters with surrogate pairs instead of just writing the character directly. The kinds of things that make people think Unicode is complex and prone to security issues, even though neither of them are allowed by the UTF-8 spec! Normalization alone does not know how to restore a potentially normalized name. It needs some sort of flag character that says this name was normalized, and a way to choose between de-normalized forms when more than one denormalized form maps to one particular normal form. Once you start stacking multiple accents this becomes unworkable. I haven't used Apple OSses since around 10.4, but Mac OS X was doing a thing where certain well-known directory names were aliased according to the current locale. For instance, the user's music directory was shown as 「音楽」 when the locale was set to ja_JP.UTF-8. IMO this is totally crazy behavior and unrelated to the Unicode issue. -- Anthony J. Bentley
Re: ffs and utf8
Anthony J. Bentley said: I haven't used Apple OSses since around 10.4, but Mac OS X was doing a thing where certain well-known directory names were aliased according to the current locale. For instance, the user's music directory was shown as 「音楽」 when the locale was set to ja_JP.UTF-8. IMO this is totally crazy behavior and unrelated to the Unicode issue. GNOME does this too. It goes even further - proposes to rename XDG directories if locale changes. Most amusingly, if you happen run GNOME and Firefox with English locale and then switch to non-English locale, your GNOME will rename XDG directories to new locale defaults, and Firefox will re-create ~/Desktop. I rarely have to deal with systems with non-English locales, but each and every time I have to, I get terrified with the changes since the last time. -- Dmitrij D. Czarkoff
Re: ffs and utf8
On Wed, Dec 3, 2014 at 9:09 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote: Anthony J. Bentley said: I haven't used Apple OSses since around 10.4, but Mac OS X was doing a thing where certain well-known directory names were aliased according to the current locale. For instance, the user's music directory was shown as 「音楽」 when the locale was set to ja_JP.UTF-8. IMO this is totally crazy behavior and unrelated to the Unicode issue. GNOME does this too. It goes even further - proposes to rename XDG directories if locale changes. Most amusingly, if you happen run GNOME and Firefox with English locale and then switch to non-English locale, your GNOME will rename XDG directories to new locale defaults, and Firefox will re-create ~/Desktop. I rarely have to deal with systems with non-English locales, but each and every time I have to, I get terrified with the changes since the last time. 8-/ One of the reasons I quit using gnome. If there were a way of specifying the initial locale when you create a new login id, that locale could specify the language to create these directory names in, and then they should never change. My memory is that you have to log in once to do that, however. Maybe it would be better just to not make those directories until they are needed by an application, and then ask the user to name them instead of providing standard names. -- Joel Rees Be careful when you look at conspiracy. Look first in your own heart, and ask yourself if you are not your own worst enemy. Arm yourself with knowledge of yourself, as well.
Re: ffs and utf8
First of all, I really don't believe that preservation of non-canonical form should be a consideration for any software. There is no single reason to allow non-canonical forms to exist at all, while there are several reasons to avoid them. More so for foreign encodings in filenames - if you are trying to store UTF-16 names on a system with UTF-8 locale, you should be converting, not escaping. Doing otherwise is just asking for troubles. Next, I assume that ability to enter filenames trumps ability to preserve original filename on Unix-like systems. In most cases right now these two values don't clash, because user input is normalized from the very beginning in IME. That said, there may be exceptions. Eg. several mail clients won't normalize filename if input encoding matches encoding of attachement. Thus, having recieved a file with non-ASCII filename from Mac, you'll end up being unable to address it from shell even if it was typed using exactly the same keyboard layout you use. I don't see how this situation may be justified. The rare cases when original filenames must be preserved byte to byte warrant some special handling (eg. storing filenames elsewhere separately or preserving the whole files with names and attributes in some archive or other form of special database). Finally, provided that both ends of network communication use canonical forms for Unicode, the matter of storing file remotely and then recieving it back with filename intact is simply a matter of normalization on reciever's side. That is: if you prefer your local files in NFD, and your NAS uses NFC, you should simply normalize filenames when you recieve files back. The only potential problem here is compatibility normalizations, but these are already problematic enough to be avoided in all cases where NFD or NFC do the job. -- Dmitrij D. Czarkoff
Re: ffs and utf8
2014/12/03 22:23 Dmitrij D. Czarkoff czark...@gmail.com: First of all, I really don't believe that preservation of non-canonical form should be a consideration for any software. There is no particular canonical form for some kinds of software. Unix, in particular, happens to have file name limitations that are compatible with all versions of Unicode past 2.0, at least, in UTF-8, but it has no native encoding. Most of the tools support ASCII, many now support Unicode. But there is no native encoding. That's one of the strengths of Unix. There is no single reason to allow non-canonical forms to exist at all, non-canonical forms in what context? while there are several reasons to avoid them. Which non-canonical forms? More so for foreign encodings in filenames - Define foreign encoding, too. Make sure your definition works for my context. Now, if you don't mind keeping my data away from your machine, maybe it's okay if your definition doesn't work for my context. For some 7 billion definitons of me. if you are trying to store UTF-16 names on a system with UTF-8 locale, you should be converting, not escaping. Not much argument with that. Many things that can be done should not necessarily be done. Most of the time, anyway. There may be some special cases, but you are talking about file names, and I don't think of any, right off the bat. Doing otherwise is just asking for troubles. Oh, I just thought of a couple of exceptions. Theoretical at this point, but definitely exceptions. There's no rule that an OS has to use byte-string file names. (And you don't have to do the stupid things a certain well-known OS does, that uses UCS-16 as its native transform and Unicode as its native encoding.) But you know that. Next, I assume that ability to enter filenames trumps ability to preserve original filename on Unix-like systems. Entering file names is a function of the tools, not of the OS. And if you want tools that are limited to NFD, you are free to build and use them. In most cases right now these two values don't clash, because user input is normalized from the very beginning in IME. Choice, function, and construction of the input stack (and output stack) is nearly completely independent of the OS (for any decent OS). That said, there may be exceptions. Eg. several mail clients won't normalize filename if input encoding matches encoding of attachement. Mail clients are also pretty independent of the OS. Thus, having recieved a file with non-ASCII filename from Mac, you'll end up being unable to address it from shell even if it was typed using exactly the same keyboard layout you use. Keyboard layout is independent of the OS. And it is actually possible to set up an openbsd keyboard and input method that closely mimics a Macintosh. I don't see how this situation may be justified. Doesn't need to be. Only needs to be worked around. The rare cases when original filenames must be preserved byte to byte warrant some special handling (eg. storing filenames elsewhere separately or preserving the whole files with names and attributes in some archive or other form of special database). Actually, the contexts in which data handling should be orthogonal to filename encodings are the more common contexts. The OS has to do a lot that the user never sees, and those internal functions just start fighting each other when they start making assumptions like encodings. Finally, provided that both ends of network communication use canonical forms for Unicode, the matter of storing file remotely and then recieving it back with filename intact is simply a matter of normalization on reciever's side. As long as you don't drop bytes somehow on the way from here to there. That is: if you prefer your local files in NFD, and your NAS uses NFC, you should simply normalize filenames when you recieve files back. Not OS issues. Application issues. Maybe tool issues, for a limited subset of tools. The only potential problem here is compatibility normalizations, but these are already problematic enough to be avoided in all cases where NFD or NFC do the job. Broken compatibility normalizations get invented precisely because OS architects think an OS needs a native encoding. Remember, the Universal TransForms were invented independently of Unicode. They were adopted by the Unicode Consortium about the time the Consortium finally became convinced that there really are more than 65,536 character-like objects that need a code point in a modern information encoding scheme. UTF-8 and Unicode are not equivalent. Joel Rees Computer memory is just fancy paper, CPUs just fancy pens. All is a stream of text flowing from the past into the future.
Re: ffs and utf8
Joel Rees writes: 2014/12/03 22:23 Dmitrij D. Czarkoff czark...@gmail.com: First of all, I really don't believe that preservation of non-canonical form should be a consideration for any software. There is no particular canonical form for some kinds of software. Unix, in particular, happens to have file name limitations that are compatible with all versions of Unicode past 2.0, at least, in UTF-8, but it has no native encoding. To me, the current state of affairs--where filenames can contain anything and the same filename can and does get interpreted differently by different programs--feels extremely dangerous. Moving to a single, well-defined encoding for filenames would make things simpler and safer. Well, it might. That's why we're discussing this carefully, to figure out if something like this is actually workable. There are two kinds of features being discussed: 1) Unicode normalization. This is analogous to case insensitivity: multiple filenames map to the same (normalized) filename. 2) Disallowing particular characters. 1-31 and invalid UTF-8 sequences are popular examples. Maybe one is workable. Maybe both are, or neither. Say I have a hypothetical machine with the above two features (normalizing to NFC, disallowing 1-31/invalid UTF-8). Now I log into a typical Unix anything but \0 or / machine, via SFTP or whatever. What are the failure modes? The first kind is that I could type get x followed by get y, where x and y are canonically the same in Unicode but represented differently because they're not normalized on the remote host. I would expect this to work smoothly: first I download x to NFC(x), and then b overwrites it. The second kind is that I could type get z, where z contains an invalid character. How should my system handle this? Error as if I had asked for a filename that's too long? Come up with a new errno? I don't know, but in this hypothetical machine it should fail somehow. But creating new files is only part of the problem. If we still allow them in existing files, we lose all the security/robustness benefits and just annoy ourselves by adding restrictions with no point. So say I mount a filesystem containing the same files a, b, and c. What happens? - Fail to mount? (Simultaneously simplest, safest, and least useful) - Hide the files? (Seems potentially unsafe) - Try to escape the filenames? (Seems crazy) Is it currently possible to take a hex editor and add / to a filename (as opposed to a pathname) inside a disk image? If that's possible, how do systems currently deal with it? Because it's the same problem. FAT32 has both case insensitivity and disallowed characters. How well does OpenBSD handle those restrictions? If not optimally, then how can they be made better? If it already handles them with aplomb, then is it applicable to the above scenarios? -- Anthony J. Bentley
Re: ffs and utf8
Joel Rees writes: 2014/12/03 22:23 Dmitrij D. Czarkoff czark...@gmail.com: First of all, I really don't believe that preservation of non-canonical form should be a consideration for any software. There is no particular canonical form for some kinds of software. Unix, in particular, happens to have file name limitations that are compatible with all versions of Unicode past 2.0, at least, in UTF-8, but it has no native encoding. To me, the current state of affairs--where filenames can contain anything and the same filename can and does get interpreted differently by different programs--feels extremely dangerous. Moving to a single, well-defined encoding for filenames would make things simpler and safer. Well, it might. That's why we're discussing this carefully, to figure out if something like this is actually workable. There are two kinds of features being discussed: 1) Unicode normalization. This is analogous to case insensitivity: multiple filenames map to the same (normalized) filename. 2) Disallowing particular characters. 1-31 and invalid UTF-8 sequences are popular examples. Maybe one is workable. Maybe both are, or neither. Say I have a hypothetical machine with the above two features (normalizing to NFC, disallowing 1-31/invalid UTF-8). Now I log into a typical Unix anything but \0 or / machine, via SFTP or whatever. What are the failure modes? The first kind is that I could type get x followed by get y, where x and y are canonically the same in Unicode but represented differently because they're not normalized on the remote host. I would expect this to work smoothly: first I download x to NFC(x), and then b overwrites it. The second kind is that I could type get z, where z contains an invalid character. How should my system handle this? Error as if I had asked for a filename that's too long? Come up with a new errno? I don't know, but in this hypothetical machine it should fail somehow. But creating new files is only part of the problem. If we still allow them in existing files, we lose all the security/robustness benefits and just annoy ourselves by adding restrictions with no point. So say I mount a filesystem containing the same files a, b, and c. What happens? - Fail to mount? (Simultaneously simplest, safest, and least useful) - Hide the files? (Seems potentially unsafe) - Try to escape the filenames? (Seems crazy) Is it currently possible to take a hex editor and add / to a filename (as opposed to a pathname) inside a disk image? If that's possible, how do systems currently deal with it? Because it's the same problem. FAT32 has both case insensitivity and disallowed characters. How well does OpenBSD handle those restrictions? If not optimally, then how can they be made better? If it already handles them with aplomb, then is it applicable to the above scenarios? http://en.wikipedia.org/wiki/Where%27s_the_beef%3F I mean, where's the diffs for all these issues? Oh. There is no beef. This is idle chatter hoping someone supplies some secret sauce that makes a disparate audience with different demands all happy. Why don't you guys go write some code and prove your points? Maybe this is simply a very hard problem, and not going to be satisfied by people who simply talk about it?
Re: ffs and utf8
Joel Rees said: Maybe it would be better just to not make those directories until they are needed by an application, and then ask the user to name them instead of providing standard names. Actually, it is still workable if you carry your ~/.config/user-dirs.dir around, so that you could install it before you first log into GNOME. I used this approach to sanitize structure of my home directory when I needed a working GNOME desktop. -- Dmitrij D. Czarkoff
Re: ffs and utf8
(apologies for the html.) 2014/12/02 9:52 Dmitrij D. Czarkoff czark...@gmail.com: Joel Rees said: Now, what would you do with this? ã¸ã§ã¨ã« Why not decompose it to the following? ï½¼ï¾ï½®ï½´ï¾ Because it is not what Unicode normalization is. Well, it definitely isn't Unicode normalization. And there is a reason, it isn't, even though there were many who thought the Unicode standard shouldn't include code points for wide form glyphs. Let's try one more. I think you have said enough that I can infer that your preferred normal form is the decomposit form. So, given that your normalization has resulted in a file named ã·ãã§ã¨ã«ã®æ and the necessity to send it back where it came from, how do you know whether or not it should be restored to ã¸ã§ã¨ã«ã®æ before you send it back? [...] -- Joel Rees
Re: ffs and utf8
Hi Ingo, Ingo Schwarze writes: While the article is old, the essence of what Schneier said here still stands, and it is not likely to fall in the future: https://www.schneier.com/crypto-gram-0007.html#9 The most interesting sentence here is: Unicode is just too complex to ever be secure. This is sort of valid, and it's why the only sane way to handle UTF-8 is to ignore the complexities and escape methods he alluded to. Codepoints should be represented with the shortest possible sequence. Surrogate pairs should not be encoded in UTF-8. Byte order marks should not exist in UTF-8. UTF-8 parsers should handle encoding errors in the same well-defined way: abort decoding on invalid sequence and retry starting with the second byte. I like how Plan 9 handled Unicode. Aside from inventing UTF-8--an encoding scheme that actually makes sense with C strings, unlike the disastrous designs-by-committee that were UCS-2 and UTF-16--they basically used it as just a way to have more than 256 characters. Most parts of Unicode proper, like collation or canonical equivalence, were simply dropped. Noncompliant? Sure, but it made things dramatically simpler. In other words, divorce UTF-8 the encoding from Unicode the standard. Homograph attacks are a real concern with any large character set. But: 1) I've been tricked by... well, not attacks, but simply badly written filenames with plain old ASCII: e instead of a, spaces instead of underscores, 0/O or l/I/1. It's easy to fool the human mind by feeding it something that sort of looks like what's expected. 2) Given that filenames can contain literally anything except / and \0, there are so many other attacks that enforcing valid UTF-8 in filenames would be a hypothetical improvement (not that I'm necessarily advocating doing that in OpenBSD). Spaces are bad enough. How many shell scripts handle *newlines* correctly? What about VT100 escape sequences? This whole thing is a security nightmare already. I happily use UTF-8 filenames on OpenBSD, and have done so for years. -- Anthony J. Bentley
Re: ffs and utf8
On Sat, Nov 29, 2014 at 09:48:53PM +0100, Dmitrij D. Czarkoff wrote: That said, the standard provides just enough facilities to make filesystem-related aspects of Unicode work nicely, particularily in case of utf-8. Eg. ability to enforce NFD for all operations on file names could actually make several things more secure by preventing homograph attacks. How do you 'enforce' NFD? Let the kernel normalize (ie /destructively/ transform) the file names behind user's back, so that a file will be listed with a different name than that with which it was created? That's very nice and secure, indeed. Reject file names that are not in NFD? But if you're into preventing people from using file names they want to use and have used without problems until now, why not just go all the way back to uppercase + the dot? And btw, normalization won't do much about 'homographs': $ echo ∕еtс∕раsswd $ rm ∕еtс∕раsswd $
Re: ffs and utf8
pizdel...@gmail.com said: How do you 'enforce' NFD? Let the kernel normalize (ie /destructively/ transform) the file names behind user's back, so that a file will be listed with a different name than that with which it was created? That's very nice and secure, indeed. I would enforce normalization at filename access time (open(), fopen(), readdir(), etc). Yes, destructively transform. I would reject filenames that won't decode. If this is documented, I just don't see how it is behind user's back, and it at least partially solves the problem of accessing right files. FWIW I've stopped using Unicode filenames after I found that I can't type in the name of file that contains only the glyphs that I can type in, just because at that time I used keyboard layout with combining diacritical marks instead of dead keys, so my input was NFD, while name of the file I got from somewhere was NFC. And btw, normalization won't do much about 'homographs': $ echo ∕еtс∕раsswd $ rm ∕еtс∕раsswd $ This is a separate problem. My suggestion does not help here, which does not render it useless for other cases. -- Dmitrij D. Czarkoff
Re: ffs and utf8
2014-12-01 10:20 GMT+01:00 Dmitrij D. Czarkoff czark...@gmail.com: pizdel...@gmail.com said: How do you 'enforce' NFD? Let the kernel normalize (ie /destructively/ transform) the file names behind user's back, so that a file will be listed with a different name than that with which it was created? That's very nice and secure, indeed. I would enforce normalization at filename access time (open(), fopen(), readdir(), etc). Yes, destructively transform. I would reject filenames that won't decode. If this is documented, I just don't see how it is behind user's back, and it at least partially solves the problem of accessing right files. I don't know if I read this wrong, but a new list of rules on how a filename can and must look would override the current allowed charsets and mangle names that my programs would want to write? No please. -- May the most significant bit of your life be positive.
Re: ffs and utf8
On Mon, Dec 01, 2014 at 10:38:40AM +0200, pizdel...@gmail.com wrote: On Sat, Nov 29, 2014 at 09:48:53PM +0100, Dmitrij D. Czarkoff wrote: That said, the standard provides just enough facilities to make filesystem-related aspects of Unicode work nicely, particularily in case of utf-8. Eg. ability to enforce NFD for all operations on file names could actually make several things more secure by preventing homograph attacks. How do you 'enforce' NFD? Anything that stores filenames outside the filesystem (e.g. in a database) for later use will have problems if NFD or NFC is enforced by the filesystem. Version control systems are particularly prone to this issue. Apple HFS+ does such normalisation behind the application's back. Put a file with a funky name on disk, read the containing directory back, and you might not find any directory entry matching the byte sequence written. Not a smart idea if you ask me since it breaks applications which weren't written with normalization in mind. Example: http://subversion.tigris.org/issues/show_bug.cgi?id=2464 Git suffers from the same problem (and ended up committing a patch that simply ignores compat to existing repositories from version 1.7.12 onwards?!?) http://mail-archives.apache.org/mod_mbox/subversion-users/201208.mbox/%3C501D29CF.6000308%40web.de%3E The only VCS I know of which normalized from day one is Veracity. Not because the developers were experts on Unicode, but because they had the benefit of hindsight -- they link to the above SVN bug from a comment in their code. Design-by-committee giving us standards that ignore existing realities.
Re: ffs and utf8
On Mon, Dec 01, 2014 at 10:20:08AM +0100, Dmitrij D. Czarkoff wrote: I would enforce normalization at filename access time (open(), fopen(), readdir(), etc). Yes, destructively transform. I would reject filenames that won't decode. If this is documented, I just don't see how it is behind user's back, and it at least partially solves the problem of accessing right files. Bad idea. See my other post. Apple did this and broke existing applications.
Re: ffs and utf8
Stefan Sperling said: Bad idea. See my other post. Apple did this and broke existing applications. OpenBSD changed time_t and broke existing applications, but hardly anyone thinks it was a bad idea. Fancy filenames are long known to be problematic, so filename policy enforcement is a breakage of the same sort. Apple have taken the lead here, and they may eventually do the same thing to industry as OpenBSD did by changing time_t. FWIW now it is rather safe to normalize filenames now, as related problems are already being solved due to breakages on OSX. Although I might be missing something, an additional function which takes desired filename and outputs normalized filename could probably solve this problem on applications' side. Such function, be it implemented in libc, could even allow system administrators enforce local file naming preference as system-wide policy. P.S.: I don't actually propose to implement filename normalization in OpenBSD right now. I've merely thrown this idea to generate potentially fruitful discussion. Don't mistake it for feature request or demand of some kind. -- Dmitrij D. Czarkoff
Re: ffs and utf8
2014-12-01 12:05 GMT+01:00 Dmitrij D. Czarkoff czark...@gmail.com: Stefan Sperling said: Bad idea. See my other post. Apple did this and broke existing applications. OpenBSD changed time_t and broke existing applications, but hardly anyone thinks it was a bad idea. Fancy filenames are long known to be problematic, so filename policy enforcement is a breakage of the same sort. Well, even if the implementation broke old assumptions on how large the storage of a time_t is, it did not make previously valid dates invalid. There is quite a bit of difference between changing the storage format and making some dates impossible that previously did work. -- May the most significant bit of your life be positive.
Re: ffs and utf8
Janne Johansson said: There is quite a bit of difference between changing the storage format and making some dates impossible that previously did work. Don't think so. Something got changed, things got broken and need to be fixed. The only real question is: is the change worth the trouble. I think it is, although unanimous negative reaction hints that I am probably missing something important. -- Dmitrij D. Czarkoff
Re: ffs and utf8
On Mon, Dec 1, 2014 at 8:43 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote: Janne Johansson said: There is quite a bit of difference between changing the storage format and making some dates impossible that previously did work. Don't think so. Something got changed, things got broken and need to be fixed. The only real question is: is the change worth the trouble. I think it is, although unanimous negative reaction hints that I am probably missing something important. Hmm. What would you suggest doing with the following file name? /etc (You may need a Japanese font to display it.) If you try to normalize it on a *nix box, it will hopefully conflict with your system file permissions. But, then what do you do with it? If you throw it away because it's non-normal, and it happens to have the parts of the new marketing plan that didn't fit under some other category, will the boss be okay with that? Maybe your company has a set of normalization rules that works okay for your company. Maybe my company doesn't work well with those rules. That's the problem. -- Joel Rees Be careful when you look at conspiracy. Look first in your own heart, and ask yourself if you are not your own worst enemy. Arm yourself with knowledge of yourself, as well.
Re: ffs and utf8
Joel Rees said: Hmm. What would you suggest doing with the following file name? /etc (You may need a Japanese font to display it.) If you try to normalize it on a *nix box, it will hopefully conflict with your system file permissions. But, then what do you do with it? If you throw it away because it's non-normal, and it happens to have the parts of the new marketing plan that didn't fit under some other category, will the boss be okay with that? I am not sure I get you. I proposed using NFD for filenames. In system implementing that the filename you provide as example above would be stored as is, as it is already NFD and can't be further decomposed. I never suggested NFKD as your message implies. -- Dmitrij D. Czarkoff
Re: ffs and utf8
On Mon, Dec 01, 2014 at 12:43, Dmitrij D. Czarkoff wrote: Janne Johansson said: There is quite a bit of difference between changing the storage format and making some dates impossible that previously did work. Don't think so. Something got changed, things got broken and need to be fixed. The only real question is: is the change worth the trouble. I think it is, although unanimous negative reaction hints that I am probably missing something important. Fixing time_t did not suddenly make OpenBSD systems unable to communicate with other systems with other time_t sizes. It was an implementation detail, but the various protocols and formats that embed dates and times in them were not changed. Your proposed change changes an important protocol: the one that lets me save files I receive from others to my filesystem. When I can no longer save web pages or email attachments and send them back to the sender with the same name, you have broken the protocol. You may also think of it this way. 64-bit time_t permitted more times to be represented. Long after the tar format itself cannot handle the current date, I will still be able to unpack old existing archives. You are proposing that *fewer* filenames be represented. My existing archives with forbidden filenames will no longer work.
Re: ffs and utf8
Joel Rees, 01 Dec 2014 22:04: Hmm. What would you suggest doing with the following file name? /etc (You may need a Japanese font to display it.) If you try to normalize it on a *nix box, it will hopefully conflict with your system file permissions. But, then what do you do with it? this example has potential to confuse of course, however NFD normalizing will not turn it into '/etc', as they are not composite characters: '/': unicode cat=Po 'e': unicode cat=Ll 't': unicode cat=Ll 'c': unicode cat=Ll (Po=other punctuation, Ll=lowercase letter) -f -- i quit drinking/smoking/ sex once. very boring 15 minutes.
Re: ffs and utf8
Stefan Sperling, 29 Nov 2014 18:17: Are you aware of 'detox' package? There's also converters/convmv $ touch »´ÁÉǑÄ« $ convmv * wrong/unknown from encoding! $ convmv -f utf8 -t latin1 * Starting a dry run without changes... iso-8859-1 doesn't cover all needed characters for: ./»´ÁÉǑÄ« To prevent damage to your files, we won't continue. First fix this or correct options! convmv is a precision tool. it needs to know the source and target encoding, and is very careful. in contrast my tool is much more blunt, because i dont care about an exact 1:1 mapping as on my disks punctuation and diacritics are not welcome. -f -- life is like... an analogy.
Re: ffs and utf8
On Mon, Dec 1, 2014 at 11:13 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote: Joel Rees said: Hmm. What would you suggest doing with the following file name? /etc (You may need a Japanese font to display it.) If you try to normalize it on a *nix box, it will hopefully conflict with your system file permissions. But, then what do you do with it? If you throw it away because it's non-normal, and it happens to have the parts of the new marketing plan that didn't fit under some other category, will the boss be okay with that? I am not sure I get you. I proposed using NFD for filenames. In system implementing that the filename you provide as example above would be stored as is, as it is already NFD and can't be further decomposed. I never suggested NFKD as your message implies. Very good. Now, what would you do with this? ジョエル Why not decompose it to the following? ジョエル I know what the Unicode rules say, but my boss says, if I'm going to play with file names, he wants it done his way. And the company across the hall has a policy just a little different, but still not matching Unicode rules. You have to keep rules about making file names for internal use separate from rules about storing filenames received, or the internal system loses its meaning. What use are systems if you have to resort to meaninglessness to use them? ;-P -- Joel Rees All truth is independent in that sphere in which God has placed it, to act for itself, as all intelligence also; otherwise there is no existence.
Re: ffs and utf8
Ted Unangst writes: On Mon, Dec 01, 2014 at 12:43, Dmitrij D. Czarkoff wrote: Janne Johansson said: There is quite a bit of difference between changing the storage format and making some dates impossible that previously did work. Don't think so. Something got changed, things got broken and need to be fixed. The only real question is: is the change worth the trouble. I think it is, although unanimous negative reaction hints that I am probably missing something important. Fixing time_t did not suddenly make OpenBSD systems unable to communicate with other systems with other time_t sizes. It was an implementation detail, but the various protocols and formats that embed dates and times in them were not changed. Your proposed change changes an important protocol: the one that lets me save files I receive from others to my filesystem. When I can no longer save web pages or email attachments and send them back to the sender with the same name, you have broken the protocol. Should I be able to save web pages or email attachments with filenames containing newlines? How about backspaces? What about terminal escape sequences, or ASCII control codes? Yes, these have been possible in Unix since time immemorial. And the fact that to this day there's no way for me to sanitize them terrifies me. -- Anthony J. Bentley
Re: ffs and utf8
Joel Rees said: Now, what would you do with this? ジョエル Why not decompose it to the following? ジョエル Because it is not what Unicode normalization is. I know what the Unicode rules say, but my boss says, if I'm going to play with file names, he wants it done his way. And now you suggest that idea of enforcing local filename policy is bad idea because local filename policy might not be sane. Ok. First, let's decouple NFD suggestion from local policy. Again, no problems with NFD here. I don't really see any sense in local policy that demands this conversion, but if your boss needs it, it is not my business. I can't get why mention it though: it is completely unrelated problem. You have to keep rules about making file names for internal use separate from rules about storing filenames received, or the internal system loses its meaning. And now you speak of normalization or of local policy? At any rate, any incoming file has a name, which is encoded somehow. It may be encoded in utf-16le, for example. Now, either you store a filename that you can't read without using iconv or another tool of a kind, or you convert the name to your locale. If your locale happens to use utf-8, you still have to convert byte sequence to another byte sequence. The conversion I proposed would convert destructively, but maintaining Unicode equivalence, so aside from subtle technical (choice of canonical form) the set of glyphs that makes the filename would remain exactly the same. This is not even a policy, just consistent representation. -- Dmitrij D. Czarkoff
Re: ffs and utf8
Joel Rees said: That said, the standard provides just enough facilities to make filesystem-related aspects of Unicode work nicely, particularily in case of utf-8. Eg. ability to enforce NFD for all operations on file names could actually make several things more secure by preventing homograph attacks. I think this assertion is a bit optimistic, and not just given your following caveat. Provided that I have to cope with Unicode file names every day, I just can't see more pessimistic approach then just allowing arbitrary Unicode codepoints with no sanitization whatsoever. Every now and then I have to use printf(1) and xclip(1x) just because there is no other way to address a file or identify all codepoints of its name. From here I don't see ability to enforce policy on Unicode strings as something as useless as you put it. -- Dmitrij D. Czarkoff
Re: ffs and utf8
Thomas Bohl said: # ls | cat Will display the characters right. Not entirely sure why though. From ls(1) manual: | -q Force printing of non-graphic characters in file names as the | character `?'; this is the default when output is to a terminal. -- Dmitrij D. Czarkoff
Re: ffs and utf8
On Sun, Nov 30, 2014 at 6:31 PM, Dmitrij D. Czarkoff czark...@gmail.com wrote: Joel Rees said: That said, the standard provides just enough facilities to make filesystem-related aspects of Unicode work nicely, particularily in case of utf-8. Eg. ability to enforce NFD for all operations on file names could actually make several things more secure by preventing homograph attacks. I think this assertion is a bit optimistic, and not just given your following caveat. Provided that I have to cope with Unicode file names every day, Same here, FWIW, Japanese. (And then there are the times I have to work on file names encoded in shift-JIS. Fun stuff.) I just can't see more pessimistic approach then just allowing arbitrary Unicode codepoints with no sanitization whatsoever. Pessimistic? Optimistic? Asking for trouble, yes. I generally try to use Romaji (latinized phonetic Japanese, all ASCII, if I avoid the overbar approach to lengthened vowels) when I know a file is going to move to another machine. If file names are strictly phonetic, you can set up a round-trip mapping from Romaji to kana, but most of the time Japanese file names include Kanji, and there is no round-trip mapping that can be meaningfully read by a human. There are ASCII-encoded JIS codes which could be used to produce round-trip mapping, but I'd need to run the output of ls through some sort of a custom filter to make sense of the names. Might be a useful thing to build. Every now and then I have to use printf(1) and xclip(1x) just because there is no other way to address a file or identify all codepoints of its name. From here I don't see ability to enforce policy on Unicode strings as something as useless as you put it. Not saying it's useless to have a policy. What I'm saying is that unicode utf-8 has parsing problems independent of issues like characters that appear the same but have separate code points. utf-8 is pretty simple until you start mapping it to real characters. Getting the mapping right is difficult, which is why you have your policy, I think. One of these days I want to build a ctype library that gives meaningful results for the Japanese subset of the CJK subset of Unicode. But that's only going to help with some of the problems. -- Joel Rees Be careful when you look at conspiracy. Look first in your own heart, and ask yourself if you are not your own worst enemy. Arm yourself with knowledge of yourself, as well.
Re: ffs and utf8
On 2014-11-29, Ingo Schwarze schwa...@usta.de wrote: But Unicode must never be allowed near anything that might get executed as program code, including scripts in interpreted languages, including, but not limited to, the shell. In particular, that means trying to handle Unicode in filenames is a bad idea. Why filenames at all? Just use inode numbers. -- Christian naddy Weisgerber na...@mips.inka.de
ffs and utf8
i have written for myself a small python3 script that removes accented characters and all utf8 symbols from filenames, a kind of utf-8 to ascii sanitizer. while working on it, i created some strange test cases (e.g. »´ÁÉǑÄ«) for filenames and i was pleasently surprised that the files were created/read/renamed/deleted without problems. is it true to say then, that ffs is entirely utf8 safe, and/or that ffs is actually an utf-8 encoded filesystem as IIRC Mac OS is? or is it some kind of happy accident that it works? :) -f -- mips = meaningless index of processor speed
Re: ffs and utf8
Hello, On 29 November 2014 at 14:02, frantisek holop min...@obiit.org wrote: i have written for myself a small python3 script that removes accented characters and all utf8 symbols from filenames, a kind of utf-8 to ascii sanitizer. Are you aware of 'detox' package? -- Regards, Ville
Re: ffs and utf8
frantisek holop, 29 Nov 2014 13:02: while working on it, i created some strange test cases (e.g. »´ÁÉǑÄ«) for filenames and i was pleasently surprised that the files were created/read/renamed/deleted without problems. i think i should clarify this a bit: they show perfect in midnight commander, not in shell. $ touch »´ÁÉǑÄ« $ ls ?? -f -- to every rule there's an exception vice versa.
Re: ffs and utf8
Ville Valkonen, 29 Nov 2014 14:08: Are you aware of 'detox' package? $ touch »´ÁÉǑÄ« $ detox * $ ls A_A_A_A_C_A_A_ $ touch »´ÁÉǑÄ« $ my_silly_script $ ls aeoa perhaps with some massaging detox can be made to work like my script, i dont know. but that is actually besides the point. i wrote my own 128 lines python script to have fun, see how some stuff works, learn about utf-8, etc. as an added bonus, it is also a mass renamer that can remove and add strings from/to filenames. and when it will grow up, it will also autotag music files, create art thumbnail and feed it to cmus! :) where is your detox now? :) -f -- dinner: dead animals and some stuff out of the ground.
Re: ffs and utf8
Shouldn't in 2014 the aim having all working in utf-8?
Re: ffs and utf8
Paolo Aglialoro, 29 Nov 2014 13:56: Shouldn't in 2014 the aim having all working in utf-8? sure. but i like my filenames ascii and whitespaceless. shows my age. -f -- what a nice night for an evening. -- steven wright
Re: ffs and utf8
frantisek holop said: is it true to say then, that ffs is entirely utf8 safe, and/or that ffs is actually an utf-8 encoded filesystem as IIRC Mac OS is? or is it some kind of happy accident that it works? :) As I get it, ffs is entirely utf8 safe because it is not encoding aware. With whatever locale commands $ touch `printf \aabb\bc` and $ touch `printf \201\202\203` succeed. (Interestingly, ls | cat in presence of filename with ASCII bell does not actually ring the bell, although backspace works as expected.) These octet arrays may happen to be valid utf-8 as well. -- Dmitrij D. Czarkoff
Re: ffs and utf8
Hi, On 29.11.2014 13:20, frantisek holop wrote: i think i should clarify this a bit: they show perfect in midnight commander, not in shell. $ touch »´ÁÉǑÄ« $ ls ?? -f I had a similar problem some time ago and have been told that the ls tool is not aware of UTF-8. See here for some details: http://marc.info/?l=openbsd-portsm=135345716931800w=2 regards Lars
Re: ffs and utf8
On Sat, Nov 29, 2014 at 13:02, frantisek holop wrote: is it true to say then, that ffs is entirely utf8 safe, and/or that ffs is actually an utf-8 encoded filesystem as IIRC Mac OS is? or is it some kind of happy accident that it works? :) FFS stores filenames as bytes.
Re: ffs and utf8
On 2014-11-29, frantisek holop min...@obiit.org wrote: is it true to say then, that ffs is entirely utf8 safe, and/or that ffs is actually an utf-8 encoded filesystem as IIRC Mac OS is? The former. Unix filesystems accept all bytes for filenames with the exception of 0x2f, which serves as directory separator, and 0x00, which terminates the string. Any encoding that doesn't conflict with these two restrictions is valid. UTF-8 was invented on Plan-9, a Unix offspring, so no surprise there. -- Christian naddy Weisgerber na...@mips.inka.de
Re: ffs and utf8
On 2014-11-29, frantisek holop min...@obiit.org wrote: $ touch »´ÁÉǑÄ« $ ls ?? If you need a locale-aware ls(1), use the one from the colorls package. (Don't worry, colored output is entirely optional.) -- Christian naddy Weisgerber na...@mips.inka.de
Re: ffs and utf8
Hi, Paolo Aglialoro wrote on Sat, Nov 29, 2014 at 01:56:23PM +0100: Shouldn't in 2014 the aim having all working in utf-8? Most definitely not, that would directly run contrary to some of OpenBSD's most important project goals: Correctness, simplicity, security. While the article is old, the essence of what Schneier said here still stands, and it is not likely to fall in the future: https://www.schneier.com/crypto-gram-0007.html#9 In conclusion, Unicode can be used for any text where it is a safe assumption that the only thing ever to be done with it is display it to the user, so the only risk would be display slightly garbled text to the user. That's why mandoc(1) isn't very worried about handling Unicode characters in manual content. But Unicode must never be allowed near anything that might get executed as program code, including scripts in interpreted languages, including, but not limited to, the shell. In particular, that means trying to handle Unicode in filenames is a bad idea. System tools (like ls(1) and sh(1)) must never attempt to interpret input as Unicode for essentially the same reasons why they must never attempt to encode output in XML. Yours, Ingo
Re: ffs and utf8
On Nov 29 13:02:34, min...@obiit.org wrote: is it true to say then, that ffs is entirely utf8 safe, and/or that ffs is actually an utf-8 encoded filesystem The file names are just strings of bytes. There is nothing UTF8 about them. On Nov 29 14:23:35, czark...@gmail.com wrote: (Interestingly, ls | cat in presence of filename with ASCII bell does not actually ring the bell, although backspace works as expected.) This cant't but remind one of Mr. Malcolm Peter Brian Telescope Adrian Umbrella Stand Jasper Wednesday (pops mouth twice) Stoatgobbler John Raw Vegetable (whinnying) Arthur Norman Michael (blows squeaker) Featherstone Smith (whistle) Northcott Edwards Harris (fires pistol, then 'whoop') Mason (chuff-chuff-chuff-chuff) Frampton Jones Fruitbat Gilbert (sings) 'We'll keep a welcome in the' (three shots) Williams If I Could Walk That Way Jenkin (squeaker) Tiger-drawers Pratt Thompson (sings) 'Raindrops Keep Falling On My Head' Darcy Carter (horn) Pussycat (sings) 'Don't Sleep In The Subway' Barton Mainwaring (hoot, 'whoop') Smith
Re: ffs and utf8
On Sat, Nov 29, 2014 at 02:08:32PM +0200, Ville Valkonen wrote: Hello, On 29 November 2014 at 14:02, frantisek holop min...@obiit.org wrote: i have written for myself a small python3 script that removes accented characters and all utf8 symbols from filenames, a kind of utf-8 to ascii sanitizer. Are you aware of 'detox' package? There's also converters/convmv
Re: ffs and utf8
Ingo Schwarze said: While the article is old, the essence of what Schneier said here still stands, and it is not likely to fall in the future: https://www.schneier.com/crypto-gram-0007.html#9 Sorry, but this article is mostly based on lack of understanding of Unicode. that would directly run contrary to some of OpenBSD's most important project goals: Correctness, simplicity, security. Yes, Unicode is very complex. Just complex enough that there is (to my knowledge) no single application that does it right in every aspect. That said, the standard provides just enough facilities to make filesystem-related aspects of Unicode work nicely, particularily in case of utf-8. Eg. ability to enforce NFD for all operations on file names could actually make several things more secure by preventing homograph attacks. Unfortunately, there is no realistic hope that NFD will be enforced by every OS and filesystem out there any time soon, so at this stage file names with bytes outside printable ASCII range will cause problems at some point. On my systems I limit filenames to [0-9A-Za-z~._/-] range. -- Dmitrij D. Czarkoff
Re: ffs and utf8
On Sun, Nov 30, 2014 at 5:48 AM, Dmitrij D. Czarkoff czark...@gmail.com wrote: Ingo Schwarze said: While the article is old, the essence of what Schneier said here still stands, and it is not likely to fall in the future: https://www.schneier.com/crypto-gram-0007.html#9 Sorry, but this article is mostly based on lack of understanding of Unicode. Sometimes I have found myself wondering whether Bruce Schneier's lack of erudition is studied. At any rate, I've found that, when he says I see smoke, there is often fire somewhere in the vicinity. that would directly run contrary to some of OpenBSD's most important project goals: Correctness, simplicity, security. Yes, Unicode is very complex. Just complex enough that there is (to my knowledge) no single application that does it right in every aspect. Considering that making a universal character encoding scheme is, in and of itself, a self-contradictory project, they've done moderately well, I think. That said, the standard provides just enough facilities to make filesystem-related aspects of Unicode work nicely, particularily in case of utf-8. Eg. ability to enforce NFD for all operations on file names could actually make several things more secure by preventing homograph attacks. I think this assertion is a bit optimistic, and not just given your following caveat. Unfortunately, there is no realistic hope that NFD will be enforced by every OS and filesystem out there any time soon, so at this stage file names with bytes outside printable ASCII range will cause problems at some point. On my systems I limit filenames to [0-9A-Za-z~._/-] range. Warning! Rambling ahead: And now I find myself bemused again by my own regular tendency to be confused by the conflation of the file name database with more general purpose database indexes. Fifteen years ago, I said to someone that the useful life of the current encoding scheme in Unicode was about twenty-five years, and that they/we should be looking for good ways to restructure it. I had trouble then figuring out a way to disentangle the various requirements, and I still don't see a clear way to it. But I'm inclined to think the original idea of a 16-bit encoding was, while not correctly seeing the reality of actually characters in use, was almost seeing the requirements of the system correctly. I think we need an international encoding that uses a restricted subset of actual characters in use, and a structure that allows for a simpler parsing of the international encoding part. (And from here my thoughts get even less coherent. Sorry for the interruption.) -- Joel Rees Be careful when you look at conspiracy. Look first in your own heart, and ask yourself if you are not your own worst enemy. Arm yourself with knowledge of yourself, as well.
Re: ffs and utf8
Am 29.11.2014 um 13:20 schrieb frantisek holop: i think i should clarify this a bit: they show perfect in midnight commander, not in shell. $ touch »´ÁÉǑÄ« $ ls ?? # ls | cat Will display the characters right. Not entirely sure why though.