Aristotle A. wrote:
> The standard Shift_JIS encoding cannot be mapped to Unicode
> without ambiguities.

This is starting to get way of topic... If anyone is bothered with this
discussion, just say so and I will continue off-list.

> Worse, there is a whole range of variants
> of this charset that all go by the name of Shift_JIS, yet map
> codepoints that are unused in the official Shift_JIS standard
> to different characters. If you have something labelled
> Shift_JIS, you cannot safely convert it to Unicode without
> risking data loss.

Yes... Shift_JIS is a mess while people use proprietary extensions, that
Unicode doesn't recognize. But the problem is no more of Unicode than it
is from Shift_JIS itself.

But the industry standard JIS X 0213 aims to fix this problem, and
converge Shift_JIS to something sane. Unicode provides round-trip
conversion for JIS X 0213 since version 3.0 or so... Yes, I just
confirmed. Here [1][2] are the mapping tables between JIS X 0213:2004
and Unicode. Also, I found [3] that discusses many of these problems of
Japanese encodings, and confirms that Unicode 3.1 provides lossless
conversion for JIS X 0213.

[1] http://x0213.org/codetable/sjis-0213-2004-std.txt
[2] http://x0213.org/codetable/jisx0213-2004-std.txt
[3] http://www.jbrowse.com/text/unij.html

The problem is that many people still use JIS X 0208 with proprietary
extensions... It is more or less the same problem that many people still
use ISO-8859 when we already have UTF-8...

> The problem is that on Unix, at least, the filesystem actually
> gives you nothing but octet sequences. The only invalid filenames
> are ones containing slashes or nulls. Everything else is fair
> game. Filenames can be any random garbage whatsoever.

Yes... This is something that Unix could improve a lot.

> Take a look at the contortions that the GNOME people had to go
> through for the file selector dialog and similar things where
> gtk+ and friends touch upon the filesystem.

Yes, I know the KDE/Qt side of this opera. KDE quite overkilled in this
point... the code that handles URL-escaping and local filename encoding
is the same. As a consequence, KDE allows you to put "/" in filenames.
It transparently converts back and forth to "%2f" for you in the filesystem.

> Then from a SCM design point of view the question turns into
> this: do we want to the repository to be unable to store some
> actual files that some users may conceivably have good reason to
> have? (See Shift_JIS mess above.)

I think that this is somewhat expected, specifically for Shift_JIS, due
to the problems above. In fact, the user is already forced to restrict
to some subset of Shift_JIS, because you need screen fonts that display
the characters. Using characters that are in proprietary extensions will
always leave you open for ambiguities: you change the font and your
filenames change. You must either restrict yourself in what characters
you use, or upgrade to JIS X 0213 and enjoy full compatibility with Unicode.

> The problem space is unfortunately much bigger and messier than
> what you have portrayed in your mail. IMO the fact that git punts
> is regrettable, but also pretty much inevitable, so it is not
> something that I hold against git. There is simply no truly sane
> way of untangling this incredibly ugly yarnball.

Well, Windows does it (NTFS is UTF-16 encoded, and Windows of course
supports Shift_JIS), MacOS X does it, Subversion does it, Bazaar does it...

It is possible and feasible to support Shift_JIS encoding... you may not
be able to store all possible filenames, only those that are standard
and convertible to Unicode, but possible nevertheless.

-- 
Juliano F. Ravasi ยทยท http://juliano.info/
5105 46CC B2B7 F0CD 5F47 E740 72CA 54F4 DF37 9E96

"A candle loses nothing by lighting another candle." -- Erin Majors

* NOTE: Don't try to reach me through this address, use "contact@" instead.
_______________________________________________
vcs-home mailing list
vcs-home@lists.madduck.net
http://lists.madduck.net/listinfo/vcs-home

Reply via email to