On Wed, Dec 21, 2011 at 01:51:56PM +1100, Martin Pool wrote: > We have a question in <https://bugs.launchpad.net/bugs/794353> and > <http://bugs.python.org/issue13643> about what encoding bzr and Python > ought to assume for file names if there is no locale configured. > > As a specific example, if you run a Python program from cron, it has > no locale by default. It tries to decode filenames as ascii. If it > encounters a non-ascii filename, it will likely crash. People hit > this kind of thing a lot with bzr; we have put in a workaround but it > seems it would be better to fix it in Python. > > My impression is the vast majority of filesystems use utf-8 names, and > that other Ubuntu software (Nautilus? U1?) assumes this will generally > be true. Does Ubuntu have any policy that filenames ought to be in > UTF-8?
No, because it would in practice be impossible to enforce such a policy. Python's notion of a "file system encoding" is fundamentally wrong-headed on Unix. Far from using UTF-8 names, Unix file systems are (perhaps unfortunately) encoding-agnostic. Unix file names are byte sequences with the only forbidden octets being NUL and '/'; there's nothing else you can assume. In practice file names will typically be in the locale encoding of the process that created them; Ubuntu has defaulted to UTF-8 for all new installations since 5.04, but real-world exceptions include people's music collections and source trees that either predate the widespread shift of Unix users to UTF-8 or that started life on some other operating system. It is perfectly possible and indeed realistic for the same file system to contain files in a variety of encodings. UTF-8 is relatively easy to distinguish heuristically from other encodings if you have enough text to work with, and in such cases I think it's reasonable to try UTF-8 first and then fall back to something else (for example, man-db does this for the contents of manual pages). It is not clear that that is viable for file names, because the amount of text involved is small and so ambiguities are more likely, but it might be worth trying. However, my feeling is that this is the sort of decision you have to make application-by-application rather than at the language level, as the consequences of a mistake will be different. -- Colin Watson [[email protected]] -- ubuntu-devel mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel
