2012/7/17 Julian Bradfield <[email protected]>: > On 2012-07-16, Philippe Verdy <[email protected]> wrote: >> I am also convinced that even Shell interpreters on Linux/Unix should >> recognize and accept the leading BOM before the hash/bang starting >> line (which is commonly used for filetype identification and runtime > The kernel doesn't know or care about character sets. It has a little > knowledge of ASCII (or possibly EBCDIC) hardwired, but otherwise it deals > with 8-bit bytes. It has no concept of "text file".
Yes I know. But most tools and script should know on which type of file they are operating on. Unfortunately the tools are as well agnostic and just rely on things that do not pass the transport protocols. Such as filename conventions. Content signatures are a well established practice ; even the hash-bang type is just one of these many signatures, and I don't see why the tools that are inspecting these data signatures to determine their behavior cannot support more signatures. The UTF-8 BOM is generic enough and used now in so many contexts or inserted on the fly that I don't see the rationale of not accepting it when it now certainly overwhelms in terms of volumes the contents tagged internally with a hadh-bang for Linux/Unix shells. > A file to be interpreted by a hashbang could in principle contain > arbitrary binary stuff, be that text in multiple encodings or just > binary data. That stuff belongs to the input to the interpreter, not > to the hashbang line: that line contains a filename which is not > intepreted in any extended charset. And why not ? You could still use UTF-8 encoded text in the command line given in this hash-bang line, to supply text parameters or information as well as leaving the rest uninterpreted by the shell but left to the tool that will be run with this supplied command line. If the rest of the file is a text script, it can continue being interpreted using the same UTF-8 encoding detected, independantly of the user's locale or console settings. Of course this also requires collaboratoin with the tool executed from the supplied command line, but I see no exclusion about why these scripts (and the underlying filesystems when running in the new locale supplied) , cannot run with UTF-8 internally and natively (notably shell interpreters).

