On Wed, 2009-02-18 11:40:43 +0100, Francois Deppierraz <[email protected]> 
wrote:
> 
> unknown encoding -> Unicode -> UTF-8 -> Unicode -> unknown encoding
> 
> I'm googling a bit to find out how other projects have implemented that.

I thought a bit longer about the topic, trying to find the more
interesting examples where filename encoding was an issue in the past
for me.

  -1-   Using DOS (FAT) or Windows, I had restrictions either with
        what could be represented in the FS (DOS, some chars reserved,
        generally interpreted with a locally configured codepage) or
        with the Operating System imposing artificial limits to keep
        compatibility (Windows+NTFS (UTF-16), not allowing certain
        chars to keep old DOS applications happy.)

  -2-   Samba+NFS with mixed Windows and Linux clients.  Initially,
        the Samba server wasn't really configured wrt. filename
        encoding, so (Windows) clients saved files with CP850 (western
        europe, containing german umlauts) encoding. Later on, we
        throughoutly switched to UTF-8 for the local store, which
        "invalidated" the filenames, because they were broken in the
        sense not being valid UTF-8.

  -3-   Shared NFS used from different machines/users using/preferring
        different encodings. This was once cleaned up using UTF-8
        throughoutly.


To draw a line, all the time the solution was converting the filenames
to UTF-8 (or UTF-16 in the NTFS case) for storing. With this in mind,
I'd implement exactly this:

  * Store a file to Tahoe:
        * If an iconv call converting the filename from UTF-8 to UTF-8
          while //TRANSLIT is not set succeeds, I'd accept the filename,
          store it (internally) as UTF-8.
        * If the former didn't work, refuse the filename and *force* the
          user supplying a from-charset name to convert it to UTF-8.
          However, additionally always allow to supply a from-charset
          name.

  * Restore a file from Tahoe:
        * Just try to use the UTF-8 encoded filename in the local
          filesystem. Fail loudly if we get an error upon open().
        * Always allow some switch to choose a to-charset name and
          shift the internal buffer through iconv.


Besides clients using local file access, there'a also the web
interface. But I guess this is a quite simple thing, because UTF-8
should basically always work, as long as '<', '>', '"' and '&' are
quoted.

MfG, JBG

-- 
      Jan-Benedict Glaw      [email protected]              +49-172-7608481
Signature of:                 Friends are relatives you make for yourself.
the second  :

Attachment: signature.asc
Description: Digital signature

_______________________________________________
tahoe-dev mailing list
[email protected]
http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev

Reply via email to