Aristotle A. wrote:
> The standard Shift_JIS encoding cannot be mapped to Unicode
> without ambiguities.

This is starting to get way of topic... If anyone is bothered with this
discussion, just say so and I will continue off-list.

> Worse, there is a whole range of variants
> of this charset that all go by the name of Shift_JIS, yet map
> codepoints that are unused in the official Shift_JIS standard
> to different characters. If you have something labelled
> Shift_JIS, you cannot safely convert it to Unicode without
> risking data loss.

Yes... Shift_JIS is a mess while people use proprietary extensions, that
Unicode doesn't recognize. But the problem is no more of Unicode than it
is from Shift_JIS itself.

But the industry standard JIS X 0213 aims to fix this problem, and
converge Shift_JIS to something sane. Unicode provides round-trip
conversion for JIS X 0213 since version 3.0 or so... Yes, I just
confirmed. Here [1][2] are the mapping tables between JIS X 0213:2004
and Unicode. Also, I found [3] that discusses many of these problems of
Japanese encodings, and confirms that Unicode 3.1 provides lossless
conversion for JIS X 0213.


The problem is that many people still use JIS X 0208 with proprietary
extensions... It is more or less the same problem that many people still
use ISO-8859 when we already have UTF-8...

> The problem is that on Unix, at least, the filesystem actually
> gives you nothing but octet sequences. The only invalid filenames
> are ones containing slashes or nulls. Everything else is fair
> game. Filenames can be any random garbage whatsoever.

Yes... This is something that Unix could improve a lot.

> Take a look at the contortions that the GNOME people had to go
> through for the file selector dialog and similar things where
> gtk+ and friends touch upon the filesystem.

Yes, I know the KDE/Qt side of this opera. KDE quite overkilled in this
point... the code that handles URL-escaping and local filename encoding
is the same. As a consequence, KDE allows you to put "/" in filenames.
It transparently converts back and forth to "%2f" for you in the filesystem.

> Then from a SCM design point of view the question turns into
> this: do we want to the repository to be unable to store some
> actual files that some users may conceivably have good reason to
> have? (See Shift_JIS mess above.)

I think that this is somewhat expected, specifically for Shift_JIS, due
to the problems above. In fact, the user is already forced to restrict
to some subset of Shift_JIS, because you need screen fonts that display
the characters. Using characters that are in proprietary extensions will
always leave you open for ambiguities: you change the font and your
filenames change. You must either restrict yourself in what characters
you use, or upgrade to JIS X 0213 and enjoy full compatibility with Unicode.

> The problem space is unfortunately much bigger and messier than
> what you have portrayed in your mail. IMO the fact that git punts
> is regrettable, but also pretty much inevitable, so it is not
> something that I hold against git. There is simply no truly sane
> way of untangling this incredibly ugly yarnball.

Well, Windows does it (NTFS is UTF-16 encoded, and Windows of course
supports Shift_JIS), MacOS X does it, Subversion does it, Bazaar does it...

It is possible and feasible to support Shift_JIS encoding... you may not
be able to store all possible filenames, only those that are standard
and convertible to Unicode, but possible nevertheless.

Juliano F. Ravasi ยทยท
5105 46CC B2B7 F0CD 5F47 E740 72CA 54F4 DF37 9E96

"A candle loses nothing by lighting another candle." -- Erin Majors

* NOTE: Don't try to reach me through this address, use "contact@" instead.
vcs-home mailing list

Reply via email to