Note that this message contains some embedded utf-8 zero-width
non-breaking spaces.  If you can't see them....  good.  :)  You should
be able to figure out where they are from context, though.

On Tue, Oct 20, 2009 at 1:26 PM, Tony Mechelynck wrote:
>
> On 20/10/09 05:17, pansz wrote:
>> If you use Linux and insist BOM in utf-8, you'll eventually hit the wall.
>
> I have already noticed personally (and the Unicode Consortium's FAQ also
> mentions) that use of a BOM conflicts with the #! shebang at the start
> of shell scripts. You call that "hitting the wall"? I just call that a
> warning that bash doesn't know about Unicode.

That issue is entirely separate from whether or not bash groks Unicode
- the #! magic is handled in the kernel.  If the first two bytes of a
file passed to exec() are 0x23 0x21, then the remainder of the line is
used as an interpreter.  0xEF 0xBB 0xBF 0x23 0x21 doesn't cut it.

Consider:

bash-3.2$ echo $'#!/bin/echo' > print_my_name
bash-3.2$ chmod +x print_my_name
bash-3.2$ ./print_my_name
./print_my_name

# kernel exec'ed [ "/bin/echo", "/bin/echo", "./print_my_name" ]

bash-3.2$ echo $'\xEF\xBB\xBF#!/bin/echo' >print_my_name
bash-3.2$ ./print_my_name
./print_my_name: line 1: #!/bin/echo: No such file or directory

# Magic number for #! not found.
# kernel exec'ed [ "/bin/sh", "./print_my_name" ]
# sh choked, trying to exec "#!/bin/echo" as a command.
#
# Note that even if sh hadn't choked on the BOM, this would have done
# the wrong thing (done nothing, instead of printing ./print_my_name).
# The problem happened before sh was called.
#
# Also note that sh would have been doing the right thing, if you
# happened to have a file named "#!/bin/echo" in your path.

bash-3.2$ echo $'whoami' >print_my_name
bash-3.2$ ./print_my_name
mwoznisk

# The above just to prove that the shell is being invoked.

> My shell scripts are all
> in 7-bit ASCII, so where's the problem?

That locale is a per-user setting, but that #! magic is handled in the
kernel.  There's no reasonable way to say at that level that if the user
is using a UTF-8 locale, a BOM should be recognized and ignored, and
that otherwise things should work the way they always have for the user.

> For C source, I have no
> firsthand experience of whether gcc accepts a starting BOM or not, but I

bash-3.2$ echo 'int main() { printf("Matt\n"); }' > print_my_name.c
bash-3.2$ gcc print_my_name.c -o print_my_name
print_my_name.c: In function ‘main’:
print_my_name.c:1: warning: incompatible implicit declaration of
built-in function ‘printf’
bash-3.2$ ./print_my_name
Matt Wozniski

# Works without BOM...

bash-3.2$ echo $'\xEF\xBB\xBF''int main() { printf("Matt\n"); }' >
print_my_name.c;
bash-3.2$ gcc print_my_name.c -o print_my_name
print_my_name.c:1: error: stray ‘\357’ in program
print_my_name.c:1: error: stray ‘\273’ in program
print_my_name.c:1: error: stray ‘\277’ in program
print_my_name.c: In function ‘main’:
print_my_name.c:1: warning: incompatible implicit declaration of
built-in function ‘printf’

# Fails with BOM

> can always use "\u1234" in the middle of an ASCII string: again, no
> problem for me. I shall accept that I am helped by the fact that I don't
> write Chinese text into program sources or shell scripts; but I do
> occasionally use Chinese text in HTML, and there the presence of a BOM
> before the <!DOCTYPE and <html> lines has never caused me any trouble. I
> also occasionally use UTF-8 for *.txt files, and there I have actually
> found the BOM to be a help in making my browser and printer react the
> way I want them to.

It's definitely useful in HTML and the like, though I doubt it affects
your printer in any way...

> For concatenation of UTF-8 files (which I rarely use if ever) a U+FEFF
> codepoint somewhere in the middle MUST be interpreted as a zero-width
> no-break space, which is deprecated but legal and should not be a
> problem. If the presence of a zero-width no-break space at the start of
> a line other than the first creates problems, then I bet there are worse
> problems than that with either the file, the software handling it, or
> both. And if it is _not_ at the start of a line, then the preceding file
> was missing an end-of-line on its last line, which would have been a
> problem even without a U+FEFF after it.

I'll just point out here that vim does display zero-width non-breaking
spaces as <feff>.  Most text editors display it, I'm sure, which could
be, if nothing else, annoying.

> If most of your UTF-8 files are shell scripts or maybe C/C++ sources
> with Chinese literals and/or Chinese comments in them, then your
> requirements are other than mine, and quite possibly your solutions will
> be different too. You are entitled to your choices, but of course you
> should be conscious of what they imply, the way I try to remain
> conscious of what my choices imply.

Using BOMs in Linux isn't any different than using BOMs on any other
system.  They're a good thing for some types of files, and a bad thing
for others, and of course you should decide on a per-file basis whether
or not it's a reasonable thing to include for that file.

~Matt

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply via email to