Re: [zfs-discuss] path-name encodings
Bart Smaalders [EMAIL PROTECTED] wrote: Marcus Sundman wrote: Bart Smaalders [EMAIL PROTECTED] wrote: UTF8 is the answer here. If you care about anything more than simple ascii and you work in more than a single locale/encoding, use UTF8. You may not understand the meaning of a filename, but at least you'll see the same characters as the person who wrote it. I think you are a bit confused. A) If you meant that _I_ should use UTF-8 then that alone won't help. Let's say the person who created the file used ISO-8859-1 and named it 'häst', i.e., 0x68e47374. If I then use UTF-8 when displaying the filename my program will be faced with the problem of what to do with the second byte, 0xe4, which can't be decoded using UTF-8. (häst is 0x68c3a47374 in UTF-8, in case someone wonders.) What I mean is very simple: The OS has no way of merging your various encodings. If I create a directory, and have people from around the world create a file in that directory named after themselves in their own character sets, what should I see when I invoke: % ls -l | less in that directory? Either (1) programs can find out what the encoding is, or (2) programs must assume the encoding is what some environment variable (or somesuch) is set to. (1) The OS doesn't have to merge anything, just let the programs handle any conversions the programs see fit. (2) The OS must transcode the filenames. If a filename is incompatible with the target encoding then the offending characters must be escaped. If you wish to share filenames across locales, I suggest you and everyone else writing to that directory use an encoding that will work across all those locales. The encoding that works well for this on Unix systems is UTF8, since it leaves '/' and NULL alone. Again, that won't work. First of all there is no way to enforce programs to use UTF-8. I can't even force my own programs to do that. (E.g., unrar or unzip or tar or 7z (can't remember which one(s)) just dump the filename data to the fs in whatever encoding they were inside the archive, and I have at least one collaboration program that also does it similarly.) Now, if I force the fs to only accept filenames compatible with UTF-8 (i.e., utf8only) then I risk losing files. I'd rather have files with incomprehensible filenames than not have them at all. OTOH, if I allow filenames incompatible with UTF-8 then my programs can't necessarily access them if I use UTF-8. I could use some 8bits/char encoding (e.g., iso-8859-15), but I'd rather not, since the world is going the way of UTF-8 and so I'd just be dragging behind. And then I would also have problems with garbage-filenames when they use UTF-8 or some other encoding. Also, I'm quite sure I do have files with names with characters not in iso-8859-15. So, you see, there is no way for me to use filenames intelligibly unless their encodings are knowable. (In fact I'm quite surprised that zfs doesn't (and even can't) know the encoding(s) of filenames. Usually Sun seems to make relatively sane design decisions. This, however, is more what I'd expect from linux with their overpragmatic who cares if it's sane, as long as it kinda works-attitudes.) Regards, Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Marcus Sundman [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Joerg Schilling) wrote: Marcus Sundman [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Joerg Schilling) wrote: [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...] Unicode is not an encoding, but you probably mean the low 8 bits of UCS-2 or the first 256 codepoints in Unicode or somesuch. Unicode _is_ an encoding that uses 21 (IIRC) bits. AFAIK you are incorrect. Unicode is a standard that, among other things, defines a _number_ for each character. A number does not equal And I tend to call the relation Character - number an encoding. As the number may be outside the range of classical characters that on most systems live inside octetts, there is a need to use another encoding on top of the unicode encoding. This second encoding is typically UTF-8 on UNIX. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Bart Smaalders [EMAIL PROTECTED] wrote: The OS has no way of merging your various encodings. If I create a directory, and have people from around the world create a file in that directory named after themselves in their own character sets, what should I see when I invoke: % ls -l | less in that directory? If you wish to share filenames across locales, I suggest you and everyone else writing to that directory use an encoding that will work across all those locales. The encoding that works well for this on Unix systems is UTF8, since it leaves '/' and NULL alone. The problem with this aproach is that all users need to change their locale encoding. Some of them may not be able to do so because they need to login into older systems that do not support UTF-8. We had less problems if UNICODE was introduced 10 years ealier. Because of missing encoding support for their countries, people in russia, china, ... did create own encoding schemes in the 1980s that are still in use. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Do you happen to know where programs in (Open)Solaris look when they want to know how to encode text to be used in a filename? Is it LC_CTYPE? In general, they don't. Command-line utilities just use the sequence of bytes entered by the user. GUI-based software does as well, but the encoding used for user input can sometimes be selected NFS doesn't provide a mechanism to send the encoding with the filename; I don't believe that CIFS does, either. Really?!? That's insane! How do programs know how to encode filenames to be sent over NFS or CIFS? For NFSv3, you guess. :-) It's just stream-of-bytes. For NFSv4, the encoding used to transmit data is supposed to be UTF-8, but this isn't enforced by most clients. What's more, since the encoding isn't stored, the reverse translation (UTF-8 to local encoding) would have to be done by the NFS client based on ... something. Usually this is just return the raw bytes and let the application deal with the mess. For CIFS, you can send either ASCII (which I believe really means uninterpreted bytes) or UTF-16. If you're working in UTF-16, and you're on Windows, there are two sets of APIs. The Unicode APIs will return the proper Unicode names. The non-Unicode (legacy) APIs will encode the names according to your system's current code page setting. -- Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Anton B. Rang [EMAIL PROTECTED] wrote: Do you happen to know where programs in (Open)Solaris look when they want to know how to encode text to be used in a filename? Is it LC_CTYPE? In general, they don't. Command-line utilities just use the sequence of bytes entered by the user. Obviously that depends on the application. A command-line utility that interprets an normal xml file containing filenames know the characters but not the bytes. The same goes for command-line utilities that receive the filenames as text (e.g., some file transfer utility or daemon). GUI-based software does as well, but the encoding used for user input can sometimes be selected Hmm.. I'm usually programming at quite high a level, so I'm not very familiar with how stuff works under the hood... If I run xev on my linux box (I don't have X on any (Open)Solaris) and press the Ä-key on my keyboard it says keycode 48 and keysym 0xe4, and then XLookupString gives 2 bytes: (c3 a4) ä. Thus at least XLookupString seems to know that I'm using UTF-8. Where did it (or whoever converted 0xe4 to 0xc3a4) get the needed info? - Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Marcus Sundman [EMAIL PROTECTED] writes: So, you see, there is no way for me to use filenames intelligibly unless their encodings are knowable. (In fact I'm quite surprised that zfs doesn't (and even can't) know the encoding(s) of filenames. Usually Sun seems to make relatively sane design decisions. This, however, is more what I'd expect from linux with their overpragmatic who cares if it's sane, as long as it kinda works-attitudes.) To be fair, ZFS is constrained by compatibility requirements with existing systems. For the longest time the only interpretation that Unix kernels put on the filenames passed by applications was to treat / and \000 specially. The interfaces provided to applications assume this is the entire extent of the process. Changing this incompatibly is not an option, and adding new interfaces to support this is meaningless unless there is a critical mass of applications that use them. It's not reasonable to talk about ZFS doing this, since it's just a part of the wider ecosystem. To solve this problem at the moment takes one of two approaches. 1. A userland convention is adopted to decide on what meaning the byte strings that the kernel provides have. 2. Some new interfaces are created to pass this information into the kernel and get it back. Leaving aside the merits of either approach, both of them require significant agreement from applications to use a certain approach before they reap any benefits. There's not much ZFS itself can do there. Boyd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
In general, they don't. Command-line utilities just use the sequence of bytes entered by the user. Obviously that depends on the application. A command-line utility that interprets an normal xml file containing filenames know the characters but not the bytes. The same goes for command-line utilities that receive the filenames as text (e.g., some file transfer utility or daemon). It's true that they know the characters, and not necessarily the bytes -- but all of the tools I'm aware of ignore the characters and simply treat these as bytes when it comes to making calls into the file system. If I run xev on my linux box (I don't have X on any (Open)Solaris) and press the Ä-key on my keyboard it says keycode 48 and keysym 0xe4, and then XLookupString gives 2 bytes: (c3 a4) ä. Thus at least XLookupString seems to know that I'm using UTF-8. Where did it (or whoever converted 0xe4 to 0xc3a4) get the needed info? Depending on what version of xev you've got, there's a good chance it made a call to XmbLookupString (the multibyte version of XLookupString). This uses the current locale for the encoding; the locale is stored in an environment variable which can be queried by the application. (But this has wandered afield of file systems -- though it's true that the file system could potentially look at environment variables to make encoding choices!) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Anton B. Rang [EMAIL PROTECTED] wrote: OK, thanks. I still haven't got any answer to my original question, though. I.e., is there some way to know what text the filename is, or do I have to make a more or less wild guess what encoding the program that created the file used? You have to guess. Ouch! Guessing sucks. (By the way, that's why I switched to ZFS with its internal checksums, so that I wouldn't have to guess if my data was OK.) Thanks for the answer, though. Do you happen to know where programs in (Open)Solaris look when they want to know how to encode text to be used in a filename? Is it LC_CTYPE? NFS doesn't provide a mechanism to send the encoding with the filename; I don't believe that CIFS does, either. Really?!? That's insane! How do programs know how to encode filenames to be sent over NFS or CIFS? If you're writing the application, you could store the encoding as an extended attribute of the file. This would be useful, for instance, for an AFP server. OK. But then I'd have to hack a similar change into all other programs that I use, too. The trick is that in order to support such things as casesensitivity=false for CIFS, the OS needs to know what characters are uppercase vs lowercase, which means it needs to know about encodings, and reject codepoints which cannot be classified as uppercase vs lowercase. I don't see why the OS would care about that. Isn't that the job of the CIFS daemon? The CIFS daemon can do it, but it would require that the daemon cache the whole directory in memory (at least, to get reasonable efficiency). I guess that depends on what file access functions there are for the file system. If you leave it up to the CIFS daemon, you also wind up with problems if you have a single sharepoint shared between local users, NFS CIFS -- the NFS client can create two files named a and A, but the CIFS client can only see one of those. Not necessarily. There could be some (nonstandard) way of accessing such duplicates (e.g., by having the CIFS daemon append [dup-N] or somesuch to the name). And even if that problem did exist it might still be OK for CIFS access to have that limitation. Regards, Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
[EMAIL PROTECTED] (Joerg Schilling) wrote: [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...] Unicode is not an encoding, but you probably mean the low 8 bits of UCS-2 or the first 256 codepoints in Unicode or somesuch. Regards, Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Marcus Sundman [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Joerg Schilling) wrote: [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...] Unicode is not an encoding, but you probably mean the low 8 bits of UCS-2 or the first 256 codepoints in Unicode or somesuch. Unicode _is_ an encoding that uses 21 (IIRC) bits. UCS-2 is a way to _represent_ the low 16 bits of UNICODE in a way that allows to use some tricks go bejund 16 bits. Microfoft e.g. does not go bejund 16 bits. ISO-8859-1 is a representation of the low 8 bits of UNICODE (well ISO-8859-1 is older than UNICODE ;-). ISO-8859-1 does not allow to code more than the 8 least sinificant bits from unicode. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Bart Smaalders [EMAIL PROTECTED] wrote: OK, thanks. I still haven't got any answer to my original question, though. I.e., is there some way to know what text the filename is, or do I have to make a more or less wild guess what encoding the program that created the file used? How do you expect the filesystem to know this? Open(2) takes 3 args; none of them have anything to do with the encoding. A while ago, when discussing thing with some filesystem guys, I made the proposal to introduce a new syscall to inform the kernel about the locale coding used by a process. If the kernel (or filesystem) then like to store file names in a kernel-specific way and if there is a in-kernel libiconv, the kernel could convert from/to the userland view. A problem that remains is a userland coding that probably cannot represent all characters used inside the kernel view. There are two characters not allowed in filenames: NULL and '/'. Everything else is meaning imparted by the user, just like the contents of text documents. Platforms that insist in UTF-8 codinf for filenames often disallow octett codingd tha are not valid inside a UTF-8 character sequence. The OS doesn't care; the user does. If a user creates a file named ?? in his home directory, but my encoding doesn't contain these characters, what should ls -l display? You also assume that knowing the encoding will transfer meaning... but a directory containing files named ??, ??? and ?? may as well be line noise for most of us. The OS doesn't care one whit about language or encodings (save the optional upper/lower case accommodation for CIFS). The OS simply stores files under names that don't contain either '/' or NULL. UTF8 is the answer here. If you care about anything more than simple ascii and you work in more than a single locale/encoding, use UTF8. You may not understand the meaning of a filename, but at least you'll see the same characters as the person who wrote it. UTF-8 may be the answer for many but definitely not all problems. UTF-8 may make less problems in 5 years (if more people then use it) than the problem known with UTF-8 today. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
[EMAIL PROTECTED] (Joerg Schilling) wrote: Marcus Sundman [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Joerg Schilling) wrote: [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...] Unicode is not an encoding, but you probably mean the low 8 bits of UCS-2 or the first 256 codepoints in Unicode or somesuch. Unicode _is_ an encoding that uses 21 (IIRC) bits. AFAIK you are incorrect. Unicode is a standard that, among other things, defines a _number_ for each character. A number does not equal 21 bits, even if it so happens that the highest codepoint number in the current version is no more than 21 bits long. Unicode defines (at least) 3 encodings to represent those characters: UTF-8, UTF-16 and UTF-32. Well, it doesn't very much matter exactly how the terms are defined, as long as everybody knows what's what. So, I'm sorry for nitpicking. - Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Marcus Sundman wrote: Bart Smaalders [EMAIL PROTECTED] wrote: UTF8 is the answer here. If you care about anything more than simple ascii and you work in more than a single locale/encoding, use UTF8. You may not understand the meaning of a filename, but at least you'll see the same characters as the person who wrote it. I think you are a bit confused. A) If you meant that _I_ should use UTF-8 then that alone won't help. Let's say the person who created the file used ISO-8859-1 and named it 'häst', i.e., 0x68e47374. If I then use UTF-8 when displaying the filename my program will be faced with the problem of what to do with the second byte, 0xe4, which can't be decoded using UTF-8. (häst is 0x68c3a47374 in UTF-8, in case someone wonders.) What I mean is very simple: The OS has no way of merging your various encodings. If I create a directory, and have people from around the world create a file in that directory named after themselves in their own character sets, what should I see when I invoke: % ls -l | less in that directory? If you wish to share filenames across locales, I suggest you and everyone else writing to that directory use an encoding that will work across all those locales. The encoding that works well for this on Unix systems is UTF8, since it leaves '/' and NULL alone. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Bart Smaalders [EMAIL PROTECTED] wrote: I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? Note that the normal ZFS behavior is exactly what you'd expect: you get the filenames you wanted; the same ones back you put in. OK, thanks. I still haven't got any answer to my original question, though. I.e., is there some way to know what text the filename is, or do I have to make a more or less wild guess what encoding the program that created the file used? OK, if I use utf8only then I know that all filenames can be interpreted as UTF-8. However, that's completely unacceptable for me, since I'd much rather have an important file with an incomprehensible filename than not have that important file at all. Also, what about non-UTF-8 encodings? E.g., is it possible to know whether 0xe4 is ä (as in iso-8859-1) or ф (as in iso-8859-5)? The trick is that in order to support such things as casesensitivity=false for CIFS, the OS needs to know what characters are uppercase vs lowercase, which means it needs to know about encodings, and reject codepoints which cannot be classified as uppercase vs lowercase. I don't see why the OS would care about that. Isn't that the job of the CIFS daemon? As a matter of fact I don't see why the OS would need to know how to decode any filename-bytes to text. However, I firmly believe that user applications should have that opportunity. If the encoding of filenames is not known (explicitly or implicitly) then applications don't have that opportunity. - Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
OK, thanks. I still haven't got any answer to my original question, though. I.e., is there some way to know what text the filename is, or do I have to make a more or less wild guess what encoding the program that created the file used? You have to guess. As far as I know, Apple's HFS (and HFS+) is the only file system which stores the encoding along with the filename. NFS doesn't provide a mechanism to send the encoding with the filename; I don't believe that CIFS does, either. If you're writing the application, you could store the encoding as an extended attribute of the file. This would be useful, for instance, for an AFP server. The trick is that in order to support such things as casesensitivity=false for CIFS, the OS needs to know what characters are uppercase vs lowercase, which means it needs to know about encodings, and reject codepoints which cannot be classified as uppercase vs lowercase. I don't see why the OS would care about that. Isn't that the job of the CIFS daemon? The CIFS daemon can do it, but it would require that the daemon cache the whole directory in memory (at least, to get reasonable efficiency). This doesn't work so well for large directories. If you leave it up to the CIFS daemon, you also wind up with problems if you have a single sharepoint shared between local users, NFS CIFS -- the NFS client can create two files named a and A, but the CIFS client can only see one of those. As a matter of fact I don't see why the OS would need to know how to decode any filename-bytes to text. However, I firmly believe that user applications should have that opportunity. If the encoding of filenames is not known (explicitly or implicitly) then applications don't have that opportunity. Yes -- that's why Apple includes an encoding byte in both HFS and HFS+. (In HFS+, filenames are normalized to 16-bit Unicode, but the encoding is still useful in choosing how to recompose the characters, and in providing hints for applications which prefer the names in some 8-bit encoding.) -- Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Hi Marcus, Marcus Sundman wrote: Are path-names text or raw data in zfs? I.e., is it possible to know what the name of a file/dir/whatever is, or do I have to make more or less wild guesses what encoding is used where? - Marcus I'm not sure what you are asking here. When a zfs file system is mounted, it looks like a normal unix file system, i.e., a tree of files where intermediate nodes are directories and leaf nodes may be directories or regular files. In other words, ls gives you the same kind of output you would expect on any unix file system. As to whether a file/directory name is text or binary, that depends on the name used when creating the file/directory. As far as the meta-data used to maintain the file system tree, most of this is compressed. But your question makes me wonder if you have tried zfs. If so, then I really am not sure what you are asking. If not, maybe you should try it out... max ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
[EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Marcus Sundman wrote: Are path-names text or raw data in zfs? I.e., is it possible to know what the name of a file/dir/whatever is, or do I have to make more or less wild guesses what encoding is used where? I'm not sure what you are asking here. When a zfs file system is mounted, it looks like a normal unix file system, i.e., a tree of files where intermediate nodes are directories and leaf nodes may be directories or regular files. In other words, ls gives you the same kind of output you would expect on any unix file system. As to whether a file/directory name is text or binary, that depends on the name used when creating the file/directory. As far as the meta-data used to maintain the file system tree, most of this is compressed. But your question makes me wonder if you have tried zfs. If so, then I really am not sure what you are asking. If not, maybe you should try it out... I am running it (in nexenta). Anyway, my question was whether path-names (files, dirs, links, sockets, etc) are text or raw data. Fundamentals: raw data is a list of bits, usually in groups of 8 (i.e., bytes), and text is raw data + some way of knowing how to convert that data into characters, forming strings. Example: When you go to a web-page the webserver sends the bytes of the page along with a http-header named Content-Type, which tells your browser how to interpret those bytes. Example: Some versioning systems, such as svn, are hardcoded to encode pathnames as UTF-8. So, although the encoding-metadata isn't available along with the data it is in the specification. So, once more, is it possible to know the pathnames (as text) on zfs, or are pathnames just raw data and I (or my programs) have to make more or less wild guesses about what encoding the user who created the file/dir/etc. used for its name? At least on linux it's the latter. IMO it really sucks to not be able to know the names of files/dirs/etc., because it always leads to problems. E.g., most (but not all) programs assume filenames should be encoded according to the current locale (let's say utf-8), so when a filename with another encoding (let's say iso-8859-15) is encountered various Evil(tm) things happen, such as not displaying the file(s) at all (e.g., an image viewer I've used), or replacing filenames with ?, or replacing parts of filenames with ? and decoding the rest of the filename with an obviously incorrect encoding (e.g., ls). I've even seen programs crash when they can't decode a filename. - Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
See the description of the normalization and utf8only properties in the zfs(1) man page. I think this might help you. normalization =none | formD | formKCf Indicates whether the file system should perform a unicode normalization of file names whenever two file names are compared, and which normalization algorithm should be used. File names are always stored unmodified, names are normalized as part of any comparison process. If this property is set to a legal value other than none, and the utf8only property was left unspeci- fied, the utf8only property is automatically set to on. The default value of the normalization property is none. This property cannot be changed after the file system is created. utf8only =on | off Indicates whether the file system should reject file names that include characters that are not present in the UTF-8 character code set. If this property is expli- citly set to off, the normalization property must either not be explicitly set or be set to none. The default value for the utf8only property is off. This property cannot be changed after the file system is created. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Darren J Moffat [EMAIL PROTECTED] wrote: See the description of the normalization and utf8only properties in the zfs(1) man page. I think this might help you. normalization =none | formD | formKCf That's apparently only for comparisons, so I don't see how it's relevant. utf8only =on | off Indicates whether the file system should reject file names that include characters that are not present in the UTF-8 character code set. If this property is expli- citly set to off, the normalization property must either not be explicitly set or be set to none. The default value for the utf8only property is off. This property cannot be changed after the file system is created. I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? - Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Marcus Sundman wrote: I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? Note that the normal ZFS behavior is exactly what you'd expect: you get the filenames you wanted; the same ones back you put in. The trick is that in order to support such things as casesensitivity=false for CIFS, the OS needs to know what characters are uppercase vs lowercase, which means it needs to know about encodings, and reject codepoints which cannot be classified as uppercase vs lowercase. If you're not running a CIFS server, the defaults will allow you to create files w/ utf8 names very happily. : [EMAIL PROTECTED]; cat test Τη γλώσσα μου έδωσαν ελληνική : [EMAIL PROTECTED]; cat `cat test` this is a test w/ a utf8 filename : [EMAIL PROTECTED]; ls -l total 10 -rw-r--r-- 1 bartsstaff 37 Oct 22 15:45 Makefile -rw-r--r-- 1 bartsstaff 0 Oct 22 15:46 bar -rw-r--r-- 1 bartsstaff 0 Oct 22 15:46 foo -rw-r--r-- 1 bartsstaff 55 Feb 27 19:45 test -rw-r--r-- 1 bartsstaff301 Feb 27 19:44 test~ -rw-r--r-- 1 bartsstaff 34 Feb 27 19:46 Τη γλώσσα μου έδωσαν ελληνική : [EMAIL PROTECTED]; df -h . Filesystem size used avail capacity Mounted on zfs/home 228G 136G48G74%/export/home/cyber : [EMAIL PROTECTED]; - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts You will contribute more with mercurial than with thunderbird. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Bart Smaalders wrote: Marcus Sundman wrote: I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? Note that the normal ZFS behavior is exactly what you'd expect: you get the filenames you wanted; the same ones back you put in. Does ZFS convert the strings to UTF-8 in this case or will it just store the multibyte sequence unmodified ? Bye, Roland -- __ . . __ (o.\ \/ /.o) [EMAIL PROTECTED] \__\/\/__/ MPEG specialist, CJAVASunUnix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Roland Mainz wrote: Bart Smaalders wrote: Marcus Sundman wrote: I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? Note that the normal ZFS behavior is exactly what you'd expect: you get the filenames you wanted; the same ones back you put in. Does ZFS convert the strings to UTF-8 in this case or will it just store the multibyte sequence unmodified ? ZFS doesn't muck with names it is sent when storing them on-disk. The on-disk name is exactly the sequence of bytes provided to the open(), creat(), etc. If normalization options are chosen, it may do some manipulation of the byte strings *when comparing* names, but the on-disk name should be untouched from what the user requested. -tim Bye, Roland ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Tim Haley wrote: Roland Mainz wrote: Bart Smaalders wrote: Marcus Sundman wrote: I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? Note that the normal ZFS behavior is exactly what you'd expect: you get the filenames you wanted; the same ones back you put in. Does ZFS convert the strings to UTF-8 in this case or will it just store the multibyte sequence unmodified ? ZFS doesn't muck with names it is sent when storing them on-disk. The on-disk name is exactly the sequence of bytes provided to the open(), creat(), etc. If normalization options are chosen, it may do some manipulation of the byte strings *when comparing* names, but the on-disk name should be untouched from what the user requested. Ok... that was the part which I was _praying_ for... :-) ... just some background (for those who may be puzzled by the statement above): The conversion to Unicode is not always lossless (Unicode is sometimes marketed as convert-any-encoding-to-unicode-without-loosing-any-information) ... for example if you have a mixed-language ISO-2022 character sequence the conversion to Unicode will use the language information itself and converting it back to an ISO-2022 sequence will result in a different multibyte sequence than the original input (the issue could be worked-around by inserting the language tag characters to preserve this information but almost every converter doesn't do that (and since these tags are outside the BMP you have to pray that everything in the toolchain works with Unicode charcters beyond 65535) ... ;-( ). Bye, Roland -- __ . . __ (o.\ \/ /.o) [EMAIL PROTECTED] \__\/\/__/ MPEG specialist, CJAVASunUnix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] path-name encodings
Roland Mainz wrote: Tim Haley wrote: Roland Mainz wrote: Bart Smaalders wrote: Marcus Sundman wrote: I'm unable to find more info about this. E.g., what does reject file names mean in practice? E.g., if a program tries to create a file using an utf8-incompatible filename, what happens? Does the fopen() fail? Would this normally be a problem? E.g., do tar and similar programs convert utf8-incompatible filenames to utf8 upon extraction if my locale (or wherever the fs encoding is taken from) is set to use utf-8? If they don't, then what happens with archives containing utf8-incompatible filenames? Note that the normal ZFS behavior is exactly what you'd expect: you get the filenames you wanted; the same ones back you put in. Does ZFS convert the strings to UTF-8 in this case or will it just store the multibyte sequence unmodified ? ZFS doesn't muck with names it is sent when storing them on-disk. The on-disk name is exactly the sequence of bytes provided to the open(), creat(), etc. If normalization options are chosen, it may do some manipulation of the byte strings *when comparing* names, but the on-disk name should be untouched from what the user requested. Ok... that was the part which I was _praying_ for... :-) ... just some background (for those who may be puzzled by the statement above): The conversion to Unicode is not always lossless (Unicode is sometimes marketed as convert-any-encoding-to-unicode-without-loosing-any-information) ... for example if you have a mixed-language ISO-2022 character sequence the conversion to Unicode will use the language information itself s/use/loose/ ... sorry... Bye, Roland -- __ . . __ (o.\ \/ /.o) [EMAIL PROTECTED] \__\/\/__/ MPEG specialist, CJAVASunUnix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] path-name encodings
Are path-names text or raw data in zfs? I.e., is it possible to know what the name of a file/dir/whatever is, or do I have to make more or less wild guesses what encoding is used where? - Marcus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss