Re: [zfs-discuss] path-name encodings

2008-03-05 Thread Marcus Sundman
Bart Smaalders [EMAIL PROTECTED] wrote:
 Marcus Sundman wrote:
  Bart Smaalders [EMAIL PROTECTED] wrote:
  UTF8 is the answer here.  If you care about anything more than
  simple ascii and you work in more than a single locale/encoding,
  use UTF8. You may not understand the meaning of a filename, but at
  least you'll see the same characters as the person who wrote it.
  
  I think you are a bit confused.
  
  A) If you meant that _I_ should use UTF-8 then that alone won't
  help. Let's say the person who created the file used ISO-8859-1 and
  named it 'häst', i.e., 0x68e47374. If I then use UTF-8 when
  displaying the filename my program will be faced with the problem
  of what to do with the second byte, 0xe4, which can't be decoded
  using UTF-8. (häst is 0x68c3a47374 in UTF-8, in case someone
  wonders.)
 
 What I mean is very simple:
 
 The OS has no way of merging your various encodings.  If I create a
 directory, and have people from around the world create a file
 in that directory named after themselves in their own character sets,
 what should I see when I invoke:
 
 % ls -l | less
 
 in that directory?

Either (1) programs can find out what the encoding is, or (2) programs
must assume the encoding is what some environment variable (or
somesuch) is set to.

(1) The OS doesn't have to merge anything, just let the programs
handle any conversions the programs see fit.

(2) The OS must transcode the filenames. If a filename is incompatible
with the target encoding then the offending characters must be escaped.


 If you wish to share filenames across locales, I suggest you and
 everyone else writing to that directory use an encoding that will work
 across all those locales.  The encoding that works well for this on
 Unix systems is UTF8, since it leaves '/' and NULL alone.

Again, that won't work. First of all there is no way to enforce
programs to use UTF-8. I can't even force my own programs to do that.
(E.g., unrar or unzip or tar or 7z (can't remember which one(s)) just
dump the filename data to the fs in whatever encoding they were inside
the archive, and I have at least one collaboration program that also
does it similarly.) Now, if I force the fs to only accept filenames
compatible with UTF-8 (i.e., utf8only) then I risk losing files. I'd
rather have files with incomprehensible filenames than not have them at
all. OTOH, if I allow filenames incompatible with UTF-8 then my
programs can't necessarily access them if I use UTF-8. I could use some
8bits/char encoding (e.g., iso-8859-15), but I'd rather not, since the
world is going the way of UTF-8 and so I'd just be dragging behind. And
then I would also have problems with garbage-filenames when they use
UTF-8 or some other encoding. Also, I'm quite sure I do have files with
names with characters not in iso-8859-15.

So, you see, there is no way for me to use filenames intelligibly unless
their encodings are knowable. (In fact I'm quite surprised that zfs
doesn't (and even can't) know the encoding(s) of filenames. Usually Sun
seems to make relatively sane design decisions. This, however, is more
what I'd expect from linux with their overpragmatic who cares if it's
sane, as long as it kinda works-attitudes.)


Regards,

Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-05 Thread Joerg Schilling
Marcus Sundman [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] (Joerg Schilling) wrote:
  Marcus Sundman [EMAIL PROTECTED] wrote:
   [EMAIL PROTECTED] (Joerg Schilling) wrote:
[...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]
  
   Unicode is not an encoding, but you probably mean the low 8 bits of
   UCS-2 or the first 256 codepoints in Unicode or somesuch.
  
  Unicode _is_ an encoding that uses 21 (IIRC) bits.

 AFAIK you are incorrect. Unicode is a standard that, among other
 things, defines a _number_ for each character. A number does not equal

And I tend to call the relation Character - number an encoding.

As the number may be outside the range of classical characters that
on most systems live inside octetts, there is a need to use another encoding
on top of the unicode encoding. This second encoding is typically UTF-8 on UNIX.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-05 Thread Joerg Schilling
Bart Smaalders [EMAIL PROTECTED] wrote:

 The OS has no way of merging your various encodings.  If I create a
 directory, and have people from around the world create a file
 in that directory named after themselves in their own character sets,
 what should I see when I invoke:

 % ls -l | less

 in that directory?

 If you wish to share filenames across locales, I suggest you and
 everyone else writing to that directory use an encoding that will work
 across all those locales.  The encoding that works well for this on
 Unix systems is UTF8, since it leaves '/' and NULL alone.

The problem with this aproach is that all users need to change their locale 
encoding. Some of them may not be able to do so because they need to login into
older systems that do not support UTF-8.

We had less problems if UNICODE was introduced 10 years ealier. Because of 
missing encoding support for their countries, people in russia, china, ...
did create own encoding schemes in the 1980s that are still in use.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-05 Thread Anton B. Rang
 Do you happen to know where programs in (Open)Solaris look when they
 want to know how to encode text to be used in a filename? Is it
 LC_CTYPE?

In general, they don't.  Command-line utilities just use the sequence of
bytes entered by the user.  GUI-based software does as well, but the
encoding used for user input can sometimes be selected

  NFS doesn't provide a mechanism to send the encoding with the
  filename; I don't believe that CIFS does, either.
 
 Really?!? That's insane! How do programs know how to
 encode filenames to be sent over NFS or CIFS?

For NFSv3, you guess.  :-)  It's just stream-of-bytes.

For NFSv4, the encoding used to transmit data is supposed to be UTF-8,
but this isn't enforced by most clients.  What's more, since the encoding
isn't stored, the reverse translation (UTF-8 to local encoding) would have
to be done by the NFS client based on ... something.  Usually this is
just return the raw bytes and let the application deal with the mess.

For CIFS, you can send either ASCII (which I believe really means
uninterpreted bytes) or UTF-16.  If you're working in UTF-16, and you're on
Windows, there are two sets of APIs.  The Unicode APIs will return the
proper Unicode names.  The non-Unicode (legacy) APIs will encode the
names according to your system's current code page setting.

-- Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-05 Thread Marcus Sundman
Anton B. Rang [EMAIL PROTECTED] wrote:
  Do you happen to know where programs in (Open)Solaris look when they
  want to know how to encode text to be used in a filename? Is it
  LC_CTYPE?
 
 In general, they don't.  Command-line utilities just use the sequence
 of bytes entered by the user.

Obviously that depends on the application. A command-line utility that
interprets an normal xml file containing filenames know the characters
but not the bytes. The same goes for command-line utilities that
receive the filenames as text (e.g., some file transfer utility or
daemon).

 GUI-based software does as well, but the encoding used for user input
 can sometimes be selected

Hmm.. I'm usually programming at quite high a level, so I'm not very
familiar with how stuff works under the hood...
If I run xev on my linux box (I don't have X on any (Open)Solaris) and
press the Ä-key on my keyboard it says keycode 48 and keysym 0xe4,
and then XLookupString gives 2 bytes: (c3 a4) ä. Thus at least
XLookupString seems to know that I'm using UTF-8. Where did it (or
whoever converted 0xe4 to 0xc3a4) get the needed info?


- Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-05 Thread Boyd Adamson
Marcus Sundman [EMAIL PROTECTED] writes:
 So, you see, there is no way for me to use filenames intelligibly unless
 their encodings are knowable. (In fact I'm quite surprised that zfs
 doesn't (and even can't) know the encoding(s) of filenames. Usually Sun
 seems to make relatively sane design decisions. This, however, is more
 what I'd expect from linux with their overpragmatic who cares if it's
 sane, as long as it kinda works-attitudes.)

To be fair, ZFS is constrained by compatibility requirements with
existing systems. For the longest time the only interpretation that Unix
kernels put on the filenames passed by applications was to treat / and
\000 specially. The interfaces provided to applications assume this is
the entire extent of the process. 

Changing this incompatibly is not an option, and adding new interfaces
to support this is meaningless unless there is a critical mass of
applications that use them. It's not reasonable to talk about ZFS
doing this, since it's just a part of the wider ecosystem.

To solve this problem at the moment takes one of two approaches.

1. A userland convention is adopted to decide on what meaning the byte
strings that the kernel provides have.

2. Some new interfaces are created to pass this information into the
kernel and get it back.

Leaving aside the merits of either approach, both of them require
significant agreement from applications to use a certain approach before
they reap any benefits. There's not much ZFS itself can do there.

Boyd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-05 Thread Anton B. Rang
  In general, they don't.  Command-line utilities just use the sequence
  of bytes entered by the user.
 
 Obviously that depends on the application. A command-line utility that
 interprets an normal xml file containing filenames know the characters
 but not the bytes. The same goes for command-line utilities that
 receive the filenames as text (e.g., some file transfer utility or daemon).

It's true that they know the characters, and not necessarily the bytes -- but
all of the tools I'm aware of ignore the characters and simply treat these
as bytes when it comes to making calls into the file system.

 If I run xev on my linux box (I don't have X on any (Open)Solaris) and
 press the Ä-key on my keyboard it says keycode 48 and keysym 0xe4,
 and then XLookupString gives 2 bytes: (c3 a4) ä. Thus at least
 XLookupString seems to know that I'm using UTF-8. Where did it (or
 whoever converted 0xe4 to 0xc3a4) get the needed info?

Depending on what version of xev you've got, there's a good chance it made a 
call to XmbLookupString (the multibyte version of XLookupString). This uses 
the current locale for the encoding; the locale is stored in an environment 
variable which can be queried by the application. (But this has wandered afield 
of file systems -- though it's true that the file system could potentially look 
at environment variables to make encoding choices!)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-04 Thread Marcus Sundman
Anton B. Rang [EMAIL PROTECTED] wrote:
  OK, thanks. I still haven't got any answer to my original question,
  though. I.e., is there some way to know what text the
  filename is, or do I have to make a more or less wild guess what
  encoding the program that created the file used?
 
 You have to guess.

Ouch! Guessing sucks. (By the way, that's why I switched to ZFS with its
internal checksums, so that I wouldn't have to guess if my data was OK.)

Thanks for the answer, though.

Do you happen to know where programs in (Open)Solaris look when they
want to know how to encode text to be used in a filename? Is it
LC_CTYPE?

 NFS doesn't provide a mechanism to send the encoding with the
 filename; I don't believe that CIFS does, either.

Really?!? That's insane! How do programs know how to encode filenames
to be sent over NFS or CIFS?

 If you're writing the application, you could store the encoding as an
 extended attribute of the file. This would be useful, for instance,
 for an AFP server.

OK. But then I'd have to hack a similar change into all other programs
that I use, too.

   The trick is that in order to support such things as
   casesensitivity=false for CIFS, the OS needs to know what
   characters are uppercase vs lowercase, which means it needs to
   know about encodings, and reject codepoints which cannot be
   classified as uppercase vs lowercase.
  
  I don't see why the OS would care about that. Isn't that the job of
  the CIFS daemon?
 
 The CIFS daemon can do it, but it would require that the daemon cache
 the whole directory in memory (at least, to get reasonable
 efficiency).

I guess that depends on what file access functions there are for the
file system.

 If you leave it up to the CIFS daemon, you also wind up with problems
 if you have a single sharepoint shared between local users, NFS 
 CIFS -- the NFS client can create two files named a and A, but
 the CIFS client can only see one of those.

Not necessarily. There could be some (nonstandard) way of accessing
such duplicates (e.g., by having the CIFS daemon append [dup-N] or
somesuch to the name). And even if that problem did exist it might still
be OK for CIFS access to have that limitation.


Regards,

Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-04 Thread Marcus Sundman
[EMAIL PROTECTED] (Joerg Schilling) wrote:
 [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]

Unicode is not an encoding, but you probably mean the low 8 bits of
UCS-2 or the first 256 codepoints in Unicode or somesuch.


Regards,

Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-04 Thread Joerg Schilling
Marcus Sundman [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] (Joerg Schilling) wrote:
  [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]

 Unicode is not an encoding, but you probably mean the low 8 bits of
 UCS-2 or the first 256 codepoints in Unicode or somesuch.

Unicode _is_ an encoding that uses 21 (IIRC) bits.

UCS-2 is a way to _represent_ the low 16 bits of UNICODE in a way that allows 
to 
use some tricks go bejund 16 bits. Microfoft e.g. does not go bejund 16 bits.

ISO-8859-1 is a representation of the low 8 bits of UNICODE (well ISO-8859-1
is older than UNICODE ;-). ISO-8859-1 does not allow to code more than the
8 least sinificant bits from unicode.



Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-04 Thread Joerg Schilling
Bart Smaalders [EMAIL PROTECTED] wrote:

  OK, thanks. I still haven't got any answer to my original question,
  though. I.e., is there some way to know what text the filename is, or
  do I have to make a more or less wild guess what encoding the program
  that created the file used?

 How do you expect the filesystem to know this?  Open(2) takes 3 args;
 none of them have anything to do with the encoding.

A while ago, when discussing thing with some filesystem guys, I made the 
proposal to introduce a new syscall to inform the kernel about the locale 
coding used by a process. If the kernel (or filesystem) then like to store
file names in a kernel-specific way and if there is a in-kernel libiconv,
the kernel could convert from/to the userland view. A problem that remains
is a userland coding that probably cannot represent all characters used 
inside the kernel view.


 There are two characters not allowed in filenames: NULL and '/'.  Everything
 else is meaning imparted by the user, just like the contents of text
 documents.

Platforms that insist in UTF-8 codinf for filenames often disallow octett 
codingd tha are not valid inside a UTF-8 character sequence.


 The OS doesn't care; the user does.  If a user creates a file named
 ?? in his home directory, but my encoding doesn't contain 
 these 
 characters,
 what should ls -l display?  You also assume that knowing the encoding
 will transfer meaning... but a directory containing files named
 ??, ??? and ?? may as well be 
 line noise for most of us.

 The OS doesn't care one whit about language or encodings (save
 the optional upper/lower case accommodation for CIFS).  The OS simply
 stores files under names that don't contain either '/' or NULL.

 UTF8 is the answer here.  If you care about anything more than simple
 ascii and you work in more than a single locale/encoding, use UTF8.
 You may not understand the meaning of a filename, but at least
 you'll see the same characters as the person who wrote it.

UTF-8 may be the answer for many but definitely not all problems.
UTF-8 may make less problems in 5 years (if more people then use it) than
the problem known with UTF-8 today.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-04 Thread Marcus Sundman
[EMAIL PROTECTED] (Joerg Schilling) wrote:
 Marcus Sundman [EMAIL PROTECTED] wrote:
  [EMAIL PROTECTED] (Joerg Schilling) wrote:
   [...] ISO-8859-1 (the low 8 bits of UNOICODE) [...]
 
  Unicode is not an encoding, but you probably mean the low 8 bits of
  UCS-2 or the first 256 codepoints in Unicode or somesuch.
 
 Unicode _is_ an encoding that uses 21 (IIRC) bits.

AFAIK you are incorrect. Unicode is a standard that, among other
things, defines a _number_ for each character. A number does not equal
21 bits, even if it so happens that the highest codepoint number in the
current version is no more than 21 bits long. Unicode defines (at
least) 3 encodings to represent those characters: UTF-8, UTF-16 and
UTF-32.

Well, it doesn't very much matter exactly how the terms are defined, as
long as everybody knows what's what. So, I'm sorry for nitpicking.


- Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-03-04 Thread Bart Smaalders
Marcus Sundman wrote:
 Bart Smaalders [EMAIL PROTECTED] wrote:
 UTF8 is the answer here.  If you care about anything more than simple
 ascii and you work in more than a single locale/encoding, use UTF8.
 You may not understand the meaning of a filename, but at least
 you'll see the same characters as the person who wrote it.
 
 I think you are a bit confused.
 
 A) If you meant that _I_ should use UTF-8 then that alone won't help.
 Let's say the person who created the file used ISO-8859-1 and named it
 'häst', i.e., 0x68e47374. If I then use UTF-8 when displaying the
 filename my program will be faced with the problem of what to do with
 the second byte, 0xe4, which can't be decoded using UTF-8. (häst is
 0x68c3a47374 in UTF-8, in case someone wonders.)

What I mean is very simple:

The OS has no way of merging your various encodings.  If I create a
directory, and have people from around the world create a file
in that directory named after themselves in their own character sets,
what should I see when I invoke:

% ls -l | less

in that directory?

If you wish to share filenames across locales, I suggest you and
everyone else writing to that directory use an encoding that will work
across all those locales.  The encoding that works well for this on
Unix systems is UTF8, since it leaves '/' and NULL alone.

- Bart




-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-28 Thread Marcus Sundman
Bart Smaalders [EMAIL PROTECTED] wrote:
  I'm unable to find more info about this. E.g., what does reject
  file names mean in practice? E.g., if a program tries to create a
  file using an utf8-incompatible filename, what happens? Does the
  fopen() fail? Would this normally be a problem? E.g., do tar and
  similar programs convert utf8-incompatible filenames to utf8 upon
  extraction if my locale (or wherever the fs encoding is taken from)
  is set to use utf-8? If they don't, then what happens with archives
  containing utf8-incompatible filenames?
 
 
 Note that the normal ZFS behavior is exactly what you'd expect: you
 get the filenames you wanted; the same ones back you put in.

OK, thanks. I still haven't got any answer to my original question,
though. I.e., is there some way to know what text the filename is, or
do I have to make a more or less wild guess what encoding the program
that created the file used?

OK, if I use utf8only then I know that all filenames can be interpreted
as UTF-8. However, that's completely unacceptable for me, since I'd
much rather have an important file with an incomprehensible filename
than not have that important file at all. Also, what about non-UTF-8
encodings? E.g., is it possible to know whether 0xe4 is ä (as in
iso-8859-1) or ф (as in iso-8859-5)?

 The trick is that in order to support such things as
 casesensitivity=false for CIFS, the OS needs to know what characters
 are uppercase vs lowercase, which means it needs to know about
 encodings, and reject codepoints which cannot be classified as
 uppercase vs lowercase.

I don't see why the OS would care about that. Isn't that the job of the
CIFS daemon? As a matter of fact I don't see why the OS would need to
know how to decode any filename-bytes to text. However, I firmly
believe that user applications should have that opportunity. If the
encoding of filenames is not known (explicitly or implicitly) then
applications don't have that opportunity.


- Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-28 Thread Anton B. Rang
 OK, thanks. I still haven't got any answer to my original question,
 though. I.e., is there some way to know what text the
 filename is, or do I have to make a more or less wild guess what
 encoding the program that created the file used?

You have to guess.  As far as I know, Apple's HFS (and HFS+) is the only file 
system which stores the encoding along with the filename.  NFS doesn't provide 
a mechanism to send the encoding with the filename; I don't believe that CIFS 
does, either.

If you're writing the application, you could store the encoding as an extended 
attribute of the file. This would be useful, for instance, for an AFP server.

  The trick is that in order to support such things as
  casesensitivity=false for CIFS, the OS needs to know what characters
  are uppercase vs lowercase, which means it needs to know about
  encodings, and reject codepoints which cannot be classified as
  uppercase vs lowercase.
 
 I don't see why the OS would care about that. Isn't that the job of the
 CIFS daemon?

The CIFS daemon can do it, but it would require that the daemon cache the whole 
directory in memory (at least, to get reasonable efficiency). This doesn't work 
so well for large directories. If you leave it up to the CIFS daemon, you also 
wind up with problems if you have a single sharepoint shared between local 
users, NFS  CIFS -- the NFS client can create two files named a and A, but 
the CIFS client can only see one of those.

 As a matter of fact I don't see why the OS would need to
 know how to decode any filename-bytes to text.
 However, I firmly believe that user applications should have that
 opportunity. If the encoding of filenames is not known (explicitly or
 implicitly) then applications don't have that opportunity.

Yes -- that's why Apple includes an encoding byte in both HFS and HFS+.  (In 
HFS+, filenames are normalized to 16-bit Unicode, but the encoding is still 
useful in choosing how to recompose the characters, and in providing hints for 
applications which prefer the names in some 8-bit encoding.)

-- Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread [EMAIL PROTECTED]
Hi Marcus,

Marcus Sundman wrote:
 Are path-names text or raw data in zfs? I.e., is it possible to know
 what the name of a file/dir/whatever is, or do I have to make more or
 less wild guesses what encoding is used where?

 - Marcus
   
I'm not sure what you are asking here.  When a zfs file system is 
mounted, it looks like a normal
unix file system, i.e., a tree of files where intermediate nodes are 
directories and leaf nodes may be
directories or regular files.  In other words, ls gives you the same 
kind of output you would expect on
any unix file system.  As to whether a file/directory name is text or 
binary, that depends
on the name used when creating the file/directory.  As far as the 
meta-data used to maintain the file system tree, most of this is
compressed.  But your question makes me wonder if you have tried zfs.  
If so, then I really am not sure
what you are asking.  If not, maybe you should try it out...

max

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Marcus Sundman
[EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Marcus Sundman wrote:
  Are path-names text or raw data in zfs? I.e., is it possible to know
  what the name of a file/dir/whatever is, or do I have to make more
  or less wild guesses what encoding is used where?
 
 I'm not sure what you are asking here.  When a zfs file system is 
 mounted, it looks like a normal unix file system, i.e., a tree of
 files where intermediate nodes are directories and leaf nodes may be
 directories or regular files.  In other words, ls gives you the same 
 kind of output you would expect on any unix file system.  As to
 whether a file/directory name is text or binary, that depends
 on the name used when creating the file/directory.  As far as the 
 meta-data used to maintain the file system tree, most of this is
 compressed.  But your question makes me wonder if you have tried
 zfs. If so, then I really am not sure what you are asking.  If not,
 maybe you should try it out...

I am running it (in nexenta).
Anyway, my question was whether path-names (files, dirs, links, sockets,
etc) are text or raw data.
Fundamentals:
raw data is a list of bits, usually in groups of 8 (i.e., bytes),
and
text is raw data + some way of knowing how to convert that data into
characters, forming strings. 

Example: When you go to a web-page the webserver sends the bytes of the
page along with a http-header named Content-Type, which tells your
browser how to interpret those bytes.

Example: Some versioning systems, such as svn, are hardcoded to encode
pathnames as UTF-8. So, although the encoding-metadata isn't available
along with the data it is in the specification.

So, once more, is it possible to know the pathnames (as text) on zfs,
or are pathnames just raw data and I (or my programs) have to make more
or less wild guesses about what encoding the user who created the
file/dir/etc. used for its name?

At least on linux it's the latter. IMO it really sucks to not be able
to know the names of files/dirs/etc., because it always leads to
problems. E.g., most (but not all) programs assume filenames should be
encoded according to the current locale (let's say utf-8), so when a
filename with another encoding (let's say iso-8859-15) is encountered
various Evil(tm) things happen, such as not displaying the file(s) at
all (e.g., an image viewer I've used), or replacing filenames with ?,
or replacing parts of filenames with ? and decoding the rest of the
filename with an obviously incorrect encoding (e.g., ls). I've even
seen programs crash when they can't decode a filename.


- Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Darren J Moffat
See the description of the normalization and utf8only properties in the 
zfs(1) man page.

I think this might help you.


  normalization =none | formD | formKCf

  Indicates whether  the  file  system  should  perform  a
  unicode  normalization  of  file names whenever two file
  names are compared, and  which  normalization  algorithm
  should be used. File names are always stored unmodified,
  names are normalized as part of any comparison  process.

  If  this  property  is  set  to a legal value other than
  none, and the utf8only property  was  left  unspeci-
  fied,  the  utf8only  property is automatically set to
  on.  The default value of the normalization property
  is  none.  This  property  cannot be changed after the
  file system is created.


  utf8only =on | off

  Indicates whether the file  system  should  reject  file
  names  that  include  characters that are not present in
  the UTF-8 character code set. If this property is expli-
  citly  set  to  off,  the  normalization property must
  either not be explicitly set or be set  to  none.  The
  default value for the utf8only property is off. This
  property cannot be changed  after  the  file  system  is
  created.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Marcus Sundman
Darren J Moffat [EMAIL PROTECTED] wrote:
 See the description of the normalization and utf8only properties in
 the zfs(1) man page.
 
 I think this might help you.

   normalization =none | formD | formKCf

That's apparently only for comparisons, so I don't see how it's
relevant.

   utf8only =on | off
 
   Indicates whether the file  system  should  reject  file
   names  that  include  characters that are not present in
   the UTF-8 character code set. If this property is expli-
   citly  set  to  off,  the  normalization property must
   either not be explicitly set or be set  to  none.  The
   default value for the utf8only property is off. This
   property cannot be changed  after  the  file  system  is
   created.

I'm unable to find more info about this. E.g., what does reject file
names mean in practice? E.g., if a program tries to create a file
using an utf8-incompatible filename, what happens? Does the fopen()
fail? Would this normally be a problem? E.g., do tar and similar
programs convert utf8-incompatible filenames to utf8 upon extraction if
my locale (or wherever the fs encoding is taken from) is set to use
utf-8? If they don't, then what happens with archives containing
utf8-incompatible filenames?


- Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Bart Smaalders
Marcus Sundman wrote:

 I'm unable to find more info about this. E.g., what does reject file
 names mean in practice? E.g., if a program tries to create a file
 using an utf8-incompatible filename, what happens? Does the fopen()
 fail? Would this normally be a problem? E.g., do tar and similar
 programs convert utf8-incompatible filenames to utf8 upon extraction if
 my locale (or wherever the fs encoding is taken from) is set to use
 utf-8? If they don't, then what happens with archives containing
 utf8-incompatible filenames?


Note that the normal ZFS behavior is exactly what you'd expect: you
get the filenames you wanted; the same ones back you put in.
The trick is that in order to support such things as casesensitivity=false
for CIFS, the OS needs to know what characters are uppercase vs
lowercase, which means it needs to know about encodings, and
reject codepoints which cannot be classified as uppercase vs lowercase.

If you're not running a CIFS server, the defaults will allow you to create
files w/ utf8 names very happily.

: [EMAIL PROTECTED]; cat test
Τη γλώσσα μου έδωσαν ελληνική
: [EMAIL PROTECTED]; cat  `cat test`
this is a test w/ a utf8 filename
: [EMAIL PROTECTED]; ls -l
total 10
-rw-r--r--   1 bartsstaff 37 Oct 22 15:45 Makefile
-rw-r--r--   1 bartsstaff  0 Oct 22 15:46 bar
-rw-r--r--   1 bartsstaff  0 Oct 22 15:46 foo
-rw-r--r--   1 bartsstaff 55 Feb 27 19:45 test
-rw-r--r--   1 bartsstaff301 Feb 27 19:44 test~
-rw-r--r--   1 bartsstaff 34 Feb 27 19:46 Τη γλώσσα μου 
έδωσαν ελληνική
: [EMAIL PROTECTED]; df -h .
Filesystem size   used  avail capacity  Mounted on
zfs/home   228G   136G48G74%/export/home/cyber
: [EMAIL PROTECTED];


- Bart


-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Roland Mainz
Bart Smaalders wrote:
 Marcus Sundman wrote:
  I'm unable to find more info about this. E.g., what does reject file
  names mean in practice? E.g., if a program tries to create a file
  using an utf8-incompatible filename, what happens? Does the fopen()
  fail? Would this normally be a problem? E.g., do tar and similar
  programs convert utf8-incompatible filenames to utf8 upon extraction if
  my locale (or wherever the fs encoding is taken from) is set to use
  utf-8? If they don't, then what happens with archives containing
  utf8-incompatible filenames?
 
 Note that the normal ZFS behavior is exactly what you'd expect: you
 get the filenames you wanted; the same ones back you put in.

Does ZFS convert the strings to UTF-8 in this case or will it just store
the multibyte sequence unmodified ?



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [EMAIL PROTECTED]
  \__\/\/__/  MPEG specialist, CJAVASunUnix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Tim Haley
Roland Mainz wrote:
 Bart Smaalders wrote:
 Marcus Sundman wrote:
 I'm unable to find more info about this. E.g., what does reject file
 names mean in practice? E.g., if a program tries to create a file
 using an utf8-incompatible filename, what happens? Does the fopen()
 fail? Would this normally be a problem? E.g., do tar and similar
 programs convert utf8-incompatible filenames to utf8 upon extraction if
 my locale (or wherever the fs encoding is taken from) is set to use
 utf-8? If they don't, then what happens with archives containing
 utf8-incompatible filenames?
 Note that the normal ZFS behavior is exactly what you'd expect: you
 get the filenames you wanted; the same ones back you put in.
 
 Does ZFS convert the strings to UTF-8 in this case or will it just store
 the multibyte sequence unmodified ?
 
ZFS doesn't muck with names it is sent when storing them on-disk.  The 
on-disk name is exactly the sequence of bytes provided to the open(), 
creat(), etc.  If normalization options are chosen, it may do some 
manipulation of the byte strings *when comparing* names, but the on-disk 
name should be untouched from what the user requested.

-tim

 
 
 Bye,
 Roland
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Roland Mainz
Tim Haley wrote:
 Roland Mainz wrote:
  Bart Smaalders wrote:
  Marcus Sundman wrote:
  I'm unable to find more info about this. E.g., what does reject file
  names mean in practice? E.g., if a program tries to create a file
  using an utf8-incompatible filename, what happens? Does the fopen()
  fail? Would this normally be a problem? E.g., do tar and similar
  programs convert utf8-incompatible filenames to utf8 upon extraction if
  my locale (or wherever the fs encoding is taken from) is set to use
  utf-8? If they don't, then what happens with archives containing
  utf8-incompatible filenames?
  Note that the normal ZFS behavior is exactly what you'd expect: you
  get the filenames you wanted; the same ones back you put in.
 
  Does ZFS convert the strings to UTF-8 in this case or will it just store
  the multibyte sequence unmodified ?
 
 ZFS doesn't muck with names it is sent when storing them on-disk.  The
 on-disk name is exactly the sequence of bytes provided to the open(),
 creat(), etc.  If normalization options are chosen, it may do some
 manipulation of the byte strings *when comparing* names, but the on-disk
 name should be untouched from what the user requested.

Ok... that was the part which I was _praying_ for... :-)

... just some background (for those who may be puzzled by the statement
above): The conversion to Unicode is not always lossless (Unicode is
sometimes marketed as
convert-any-encoding-to-unicode-without-loosing-any-information) ...
for example if you have a mixed-language ISO-2022 character sequence the
conversion to Unicode will use the language information itself and
converting it back to an ISO-2022 sequence will result in a different
multibyte sequence than the original input (the issue could be
worked-around by inserting the language tag characters to preserve
this information but almost every converter doesn't do that (and since
these tags are outside the BMP you have to pray that everything in the
toolchain works with Unicode charcters beyond 65535) ... ;-( ).



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [EMAIL PROTECTED]
  \__\/\/__/  MPEG specialist, CJAVASunUnix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Roland Mainz
Roland Mainz wrote:
 Tim Haley wrote:
  Roland Mainz wrote:
   Bart Smaalders wrote:
   Marcus Sundman wrote:
   I'm unable to find more info about this. E.g., what does reject file
   names mean in practice? E.g., if a program tries to create a file
   using an utf8-incompatible filename, what happens? Does the fopen()
   fail? Would this normally be a problem? E.g., do tar and similar
   programs convert utf8-incompatible filenames to utf8 upon extraction if
   my locale (or wherever the fs encoding is taken from) is set to use
   utf-8? If they don't, then what happens with archives containing
   utf8-incompatible filenames?
   Note that the normal ZFS behavior is exactly what you'd expect: you
   get the filenames you wanted; the same ones back you put in.
  
   Does ZFS convert the strings to UTF-8 in this case or will it just store
   the multibyte sequence unmodified ?
  
  ZFS doesn't muck with names it is sent when storing them on-disk.  The
  on-disk name is exactly the sequence of bytes provided to the open(),
  creat(), etc.  If normalization options are chosen, it may do some
  manipulation of the byte strings *when comparing* names, but the on-disk
  name should be untouched from what the user requested.
 
 Ok... that was the part which I was _praying_ for... :-)
 
 ... just some background (for those who may be puzzled by the statement
 above): The conversion to Unicode is not always lossless (Unicode is
 sometimes marketed as
 convert-any-encoding-to-unicode-without-loosing-any-information) ...
 for example if you have a mixed-language ISO-2022 character sequence the
 conversion to Unicode will use the language information itself 

s/use/loose/ ... sorry...



Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [EMAIL PROTECTED]
  \__\/\/__/  MPEG specialist, CJAVASunUnix programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] path-name encodings

2008-02-26 Thread Marcus Sundman
Are path-names text or raw data in zfs? I.e., is it possible to know
what the name of a file/dir/whatever is, or do I have to make more or
less wild guesses what encoding is used where?

- Marcus
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss