RE: Roundtripping Solved

Lars Kristan Mon, 20 Dec 2004 23:55:03 -0800

Title: RE: Roundtripping Solved

Mike Ayers wrote:
> Things that are impossible that I've noticed so far:
> - A metainformation system without holes in it.

UNIX filesytems (ok, old ones) are an example of an information system that does not have metainformation about the encoding.

As for the holes, there are some gray areas in my solution, but they can be worked out.

> - Addressing files with intermixed locales reliably.
> In a UTF-8 and ISO 8859-1 mixed environment, for instance,
> there is no way to know whether <c3> <a9> indicates "�" or
> "é". The Unix locale architecture does not permit mixed
> locales. What you propose is a locale of "ISO 8859-1 or
> UTF-8, your guess is as good as mine".

On UNIX, addressing files has nothing to do with locales. Each file can be addressed reliably, in any locale (*). It is only the interpretation that is not reliable. And UNIX locale architecture definitely DOES permit mixed locales. Hence the issue. And the "ISO 8859-1 or UTF-8, your guess is as good as mine" is not something I am trying to introduce. It is already there. What I am trying, is to allow that confusion to endure a while longer. Which is not bad in itself. I think it can actually help make it quicker, not slower.

(*) MBCS can have some issues. Similar to those of UTF-8. But, A - a lot of it does work, B - what doesn't is a pain, C - those users typically only mix a MBCS and ASCII (so, no mix at all). Europe on the other hand, already mixes several Latin encodings. When that gets mixed with UTF-8, problems will be more frequent than they are with MBCS.

> - A scheme that translates all possible Unix
> filenames to unique and consistent Windows filenames. Case
> issues alone kill this.
Well, Windows actually does have the ability to handle filenames with case sensitivity. But yes, it is not used widely.

A reliable translation of UNIX filenames to Windows filenames is just one of possible goals (or uses) of my approach. If a 100% reliable solution cannot be found, it does not mean that we shouldn't be looking for the next best approach.

My specific requirements were to store UNIX filenames in a Windows database and allow proper display of them, on Windows. Case issues, '*' in filenames and such, all those represent no problem in that part of the requirements. I've seen filenames consisting solely of a newline. And can deal with them.

But let's do talk about translating UNIX filenames to Windows filenames. Users that need the interoperability have learned not to use tricky filenames, not to use filenames that differ only in the case used (which is also a bad idea in itself, it doesn't process well in our brain). So they adapted and have no problems now. But they have been using legacy encodings. Even more than one, especially when they have lots of files and are using a language where only a few letters are non-ASCII and were always able to figure out which file is which. It only affected the display, never accessing. Well, a switch to UTF-8 will bring up lots of issues for them. You think they will welcome the day and say "finally, I can solve this mess". I think they will say "oh darn, it all worked before, is this really necessary".

Getting rid of legacy encodings is a goal. But not for many users. For most of them filenames are just a tool. Their business comes first. Some can't afford to dedicate a day to convert all the filenames.

Lars

RE: Roundtripping Solved

Reply via email to