Re: Fw: Unicode filename problems

Philippe Verdy Tue, 03 Jun 2003 22:23:46 -0700

Noe the following ambiguity in the ZIP file format specification:
[QUOTE]
file name: (Variable) 
The name of the file, with optional relative path. The path stored should not contain 
a drive or device letter, or a leading slash. All slashes should be forward slashes 
'/' as opposed to backwards slashes '\' for compatibility with Amiga and UNIX file 
systems, etc. If input came from standard input, there is no file name field.
[/QUOTE]
There's no clear indication of the encoding used for the filenames in ZIP files. So if 
the "Created by" field is 0, it assumes the DOS semantics (and no support for UTF-8, 
but the exact codepage is still ambiguous, and it will probably be displayed with the 
local codepage of the sysem on which the ZIP file is read)

The other common problems are that many bogous ZIP tools also forget the 
recommandations indicated in the spec, and encode an absolute filename (with leading / 
or sometimes even a drive letter), or even keep backslashes as directory separators.

For this reason, the Java "JAR" tool and library reduced the supported "features" for 
ZIP files it creates to a common portable format assuming a format similar to relative 
URLs on the web (but without the URL encoding with %NN decimal-encoded bytes).

This is coherent with the usage of ZIP files on Unix, and Linux, where the encoding is 
also not strictly defined in the UFS-like filesystems which just store byte sequences 
for filenames, that will be rendered according to the locale preferences of the local 
user. On those Unix-like systems, the current locale plays an imporant role to 
interpret filenames from the filesystem!

This is very unlike NTFS or FAT32 longnames on Windows, and Apple HFS for Mac OS, 
which both explicitly use Unicode (normalized to NFC and serialized with the UTF-16LE 
encoding scheme on Windows, or normalized to "Apple-HFS-NFD" and serialized with the 
UTF-8 encoding scheme on Mac OS (but with some restrictions as the "Apple-HFS-NFD" 
form is a partial decomposition form which was based on Unicode 2.1 and has not been, 
and will not be, extended to include a larger Unicode set, so for interoperability 
resaons, the newer Unicode characters will be left in their current normalized form 
when sent to the filesystem to create unique filenames).

I do think that the NFC form is the best way to handle internationalized filenames 
that will fit with most OSes (including HFS, because the specific Apple-HFS 
normalization format is internal to its storage and applications can safely use only 
precomposed filenames or reconvert them back to NFC when reading a HFS catalog.

The least ambiguous format is then the one using a NTFS signature on Windows, and most 
Zip tools for Windows will now (in their current version) read such zip catalogs 
correctly even on Windows 95/98/98SE/ME.

The Joliet extension to ISO9660/HSFS is an additional catalog that provides 
NTFS/FAT32-like long filenames on top of the basic ISO9660 filename format. It does 
not dispense the application to create "short names" using the portable ASCII-based 
filenames, in a way similar to what FAT32 does on Windows to complement the basic FAT 
format used on DOS/Windows 3.x with an ambiguous "OEM codepage" encoding.

When Windows reads a CDROM catalog, it will first display the Joliet catalog if 
present, else the basic ISO9660 catalog. The RockRidge extension is ignored.

When Unix/Linux reads a CDROM catalog, most often it will first display a RockRidge 
catalog if present (which allows mapping UFS semantics and attriutes), ignoring the 
Joliet catalog, and then fallback to the basic ISO9660 catalog. On Linux, there are 
other methods to get less basic filenames, including a convention to store an 
additional catalog file (which is a simple Unix plain/text file mapping ISO9660 names 
to long Unix names, but with once again an ambiguous encoding), or additional 
filesystem drivers to also consider the now common Joliet extension, based on Unicode.

So if you use Linux to create CDROM images containing both a RockRidge and Joliet 
catalog extension, it is normal that you see the correct names on Linux with the same 
locale. But as you can see, the current locale is important for correct display of the 
CDROM catalog. Windows does not use this RockRidge catalog by default (unless you 
install a driver that supports it).

If the Windows output is "garbled", it just means that the Joliet extension created by 
your Linux tool is not correctly encoded (meaning that your Linux tool is bogous when 
it computes the Joliet extension, or that your system lacks some support libraries, or 
that the Joliet extension creation works only from a specific locale, and limited to 
the ISO-8859-1 set, and does not really support Unicode, but just consists in adding a 
trailing 00 byte to each character to map ISO-8859-1 to UTF-16LE).

Normally, Joliet extensions can be read on any localization of Windows, independantly 
of the current Windows locale, even in command-line mode, where the Unicode-based 
filename is mapped/converted to the current OEM set (which can be changed by the CHCP 
command-line tool).

To Edward: there's no user-settable encoding in a user environment on Windows. 
Filesystems on Windows are specified to use a global host setting, or a encoding fixed 
by the filesystem type. The OS will make the appropriate conversions when needed, by 
presenting to the application the Unicode filename label, which must be coherent at 
least with the ASCII-based encoding of the fallback short name which is an equivalent 
name to access to the same file. So on Windows, each file can have multiple filenames, 
and this must not create collisions.

Filesystem encodings (and in some cases too the URLs of some websites that do not 
respect the correct labeling for heir page and form encodings) are really a nightmare. 
There is no easy solution other than use filenames only as keys without user-readable 
semantics. So it is much better to create CDROMs that use only the portable ISO9660 
format, and use an additional mapping file to display user-readable labels, that will 
be stored in this mapping file as UTF-8 or one of the 3 UTF-16 encoding schemes. Your 
application can then implement a URL resolver to use these names if it uses a web-like 
navigational system.

-- Philippe.
----- Original Message ----- 
From: "Edward H Trager" <[EMAIL PROTECTED]>
> On Fri, 30 May 2003 [EMAIL PROTECTED] wrote:
> > I wonder if anyone here has ideas on these matters.
> > Peter
> >
> > ----- Forwarded by Peter Constable/IntlAdmin/WCT on 05/30/2003 10:56 PM
> > I have 3 LinguaLinks lexicons that I have converted into HTML pages - one
> > for each entry. The languages use non-ANSI characters, so I also did a
> > Unicode conversion at the same time.
> > [snip]
> >
> > Everything works very well except that I cannot burn the files onto a CD
> > because of the unicode values in the filenames. Roxio and Nero CD-burners
> > don't accept some of the higher values found in the file names (using
> > Jolliet, ISO9600 and UDF). Anyone have any ideas how to deal with this?
> > For example, a filename with unicode value 026B, a tilde lower case L,
> > causes problems.
> 
> I did a test burning of over 40 UTF-8 file names in seven different
> scripts (Arabic, Simplified & Traditional Chinese, Greek, Japanese, Latin,
> and Thai) to a CD in ISO9660 format with both Rockridge (Unix) and Joliet
> (MS) extensions using Joerg Schilling's Open Source "mkisofs" and
> "cdrecord" version 2.0 tools
> (http://www.fokus.gmd.de/research/cc/glone/employees/joerg.schilling/private/mkisofs.html)
> on Linux (SuSE 7.3).
> 
> The resulting CD preserved the UTF-8 filenames perfectly: I could view the
> file names using both "ls" from mlterm (http://mlterm.sourceforge.net/)
> and from the Mozilla browser when run under a UTF-8 locale (en_US.UTF-8)
> on Linux.
> 
> The file names did not appear correct on Windows though, but I think this
> is only because I don't know how to set the locale properly on Windows
> 2000.
> 
> Note that I didn't do anything special when burning the CD: I just burned
> it using the same options (Rockridge and Joliet extensions) that I always
> use, and there was no need to zip or tar the files.  Email me if you need
> the details of how to do it.

Re: Fw: Unicode filename problems

Reply via email to