would produce
internal ZWNBSPs is not part of any of our processing as far as I know.
-Original Message-
From: David Starner [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 07, 2002 12:14 PM
To: Markus Scherer
Cc: unicode
Subject: Re: Names for UTF-8 with and without BOM - pragmatic
O
On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote:
> The fact is that Windows uses UTF-8 and UTF-16 plain text files with
> signatures (BOMs) very simply, gracefully, and successfully. It has applied
> what I called the "pragmatic" approach here for about 10 years. It just
> works.
> Initial for each piece, as each is assumed to be a complete
> text file before concatenation. Nothing
> prevents copy/cp/cat and other commands from recognizing
> Unicode signatures, for as long as they
> don't claim to preserve initial U+FEFF.
Yes there is, in a formal sense, for cat and c
Lars Kristan wrote:
Markus Scherer wrote:
If software claims that it does not modify the contents of a
document *except* for initial U+FEFF
then it can do with initial U+FEFF what it wants. If the
whole discussion hinges on what is allowed
if software claims to not modify text then one need
> True, UTF-16 files do need a signature.
Eh, no! "UTF-16BE" and "UTF-16LE" files (or whatever kind of text
data element) do not have any signature/BOM. Not even files (somehow)
labelled "UTF-16" need have a signature/BOM, without a BOM they are
then the same as if it was labelled "UTF-16BE".
Lars Kristan wrote:
> > .txtUTF-8 require We want plain text files to
> > have BOM to distinguish
> > from legacy codepage files
>
> H, what does "plain" mean?! Perhaps files with a BOM
> should be called "text" files (or .txt files;) as
> opp
Markus Scherer wrote:
> If software claims that it does not modify the contents of a
> document *except* for initial U+FEFF
> then it can do with initial U+FEFF what it wants. If the
> whole discussion hinges on what is allowed
> if software claims to not modify text then one need
> not claim
Mark Davis wrote:
Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).
This points to a pragmatic way to deal with this issue:
If software claims that it does n
;; "Unicode Mailing List"
<[EMAIL PROTECTED]>
Sent: Sunday, November 03, 2002 13:02
Subject: Re: Names for UTF-8 with and without BOM
> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> Ironic that for the purpose of dealing with THREE bytes that so many bytes
>
t;
Sent: Sunday, November 03, 2002 13:02
Subject: Re: Names for UTF-8 with and without BOM
> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> Ironic that for the purpose of dealing with THREE bytes that so many bytes
> are being wasted. :-)
>
> > Little probabilit
t;
Sent: Sunday, November 03, 2002 13:02
Subject: Re: Names for UTF-8 with and without BOM
> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> Ironic that for the purpose of dealing with THREE bytes that so many bytes
> are being wasted. :-)
>
> > Little probabilit
Mark Davis wrote:
> Little probability that right double quote would appear at the start
> of a document either. Doesn't mean that you are free to delete it
> (*and* say that you are not modifying the contents).
True, but right double quote:
(a) has a visible glyph with a well-defined human-rea
From: "Mark Davis" <[EMAIL PROTECTED]>
Ironic that for the purpose of dealing with THREE bytes that so many bytes
are being wasted. :-)
> Little probability that right double quote would appear at the start of a
> document either. Doesn't mean that you are free to delete it (*and* say
that
> you
"Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent"
<[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Saturday, November 02, 2002 04:18
Subject: Re: Names for UTF-8 with and without BOM
> From: "Mark D
t;Murray Sargent"
<[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]>
Sent: Saturday, November 02, 2002 13:27
Subject: Re: Names for UTF-8 with and without BOM
> Mark Davis wrote:
>
> > That is not sufficient. The first three bytes could represent a real
&g
[EMAIL PROTECTED] scripsit:
> I find it interesting, then, to see Michael saying that, since Notepad
> sticks a BOM-cum-signature at the start of its UTF-8, the rest of the
> world should support it.
There is another argument, viz. ISO/IEC 10646, which plainly proclaims
that the 8-BOM is a vali
From: <[EMAIL PROTECTED]>
> In particular, I'm thinking of a situation about a year and a half ago
> (IIRC) in which Michael (and I and others) were strongly opposed to a
> suggestion that the Unicode Consortium should document a certain variation
> (perversion, some would say) of one of the Unico
On 11/02/2002 12:15:54 PM "Michael \(michka\) Kaplan" wrote:
>> .xml UTF-8N Some XML processors may not cope with BOM
>
>Maybe they need to upgrade? Since people often edit the files in notepad,
>many files are going to have it. A parser that cannot accept this reality
is
>not going to make it ve
On 11/02/2002 11:59:24 AM "Joseph Boyle" wrote:
>The first time I thought of UTF-8Y it sounded too flippant, but actually
it
>is fairly self-explanatory if UTF-8 is taken as a given, and has the
virtue
>of being short.
UTF-8Y (and UTF-8J) is not at all intuitive. "UTF-8-yuk"? The better
counte
John Cowan wrote:
>
> Tex Texin scripsit:
>
> > Interestingly, although I didn't study it in detail, looking at rfc 2376
> > for prioritization over charset conflicts, it seems to recommend
> > stripping the BOM when converting from utf-16 to other charsets (and
> > without considering that ucs
Doug,
Doug Ewell wrote:
>
> Tex Texin wrote:
>
> > However, I didn't realize that parsers were to allow for the
> > possibility of different signatures.
> > So a parser has to worry about scsu signatures, etc
>
> A parser only *has* to read UTF-8 without signature and UTF-16 with
> signatu
Tex Texin scripsit:
> Interestingly, although I didn't study it in detail, looking at rfc 2376
> for prioritization over charset conflicts, it seems to recommend
> stripping the BOM when converting from utf-16 to other charsets (and
> without considering that ucs-4 would like to keep it). (section
John Cowan wrote:
>
> Tex Texin scripsit:
>
> > So when the parser gets JOECODE, I can understand ignoring the signature
> > and autodetection, but exactly how does it find the first "<"?
>
> Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
> be UTF-32 big-endian, but we
Tex Texin wrote:
> However, I didn't realize that parsers were to allow for the
> possibility of different signatures.
> So a parser has to worry about scsu signatures, etc
A parser only *has* to read UTF-8 without signature and UTF-16 with
signature. It *may* read other encodings of its ow
unrealistic one.
MichKa
- Original Message -
From: "Tex Texin" <[EMAIL PROTECTED]>
To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
Cc: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Saturday, November 02, 2002 11:0
Tex Texin scripsit:
> So when the parser gets JOECODE, I can understand ignoring the signature
> and autodetection, but exactly how does it find the first "<"?
Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
be UTF-32 big-endian, but we'll suppose the parser can't handle
John,
I understand the flexibility of XML to use different encodings.
However, I didn't realize that parsers were to allow for the possibility
of different signatures.
So a parser has to worry about scsu signatures, etc
Whereas XML is so fussy about which characters it accepts, I am
surprised
Hi John,
I meant the character "<".
As for notepad, what I should have either stated more completely or bit
my tongue, is that where there is a standard in place (and where it is
unambiguous) the mistakes of particular products shouldn't hold sway,
unless they are tantamount to a de facto standard
Tex Texin scripsit:
> However, that leaves open the question whether only the Unicode
> transform signatures are acceptable or other signatures are also
> allowed. So if a vendor defines a code page, and defines a signature
> (perhaps mapping BOM/ZWNSP specifically to some code point or byte
> str
Tex Texin scripsit:
> I didn't think the XML standard allowed for utf-8 files to have a BOM.
This capability was never actually excluded, and was added by erratum
(and force-majeure, when it became clear that BOMful UTF-8 was going to
start becoming common). XML files are intended to be plain te
Thanks Doug. I had looked at the standard not at the appendix.
I think that (non-normative) appendix is unfortunate. It seems to imply
(to my mind) that if other character sets define BOMs that it is ok to
use them as XML signatures.
My reasoning is that the standard itself only says that UTF-16
Tex Texin wrote:
> I didn't think the XML standard allowed for utf-8 files to have a BOM.
> The standard is quite clear about requiring 0xFEFF for utf-16.
> I would have thought a proper parser would reject a non-utf-16 file
> beginning with something other than "<".
The standard explicitly allo
Mark Davis wrote:
> That is not sufficient. The first three bytes could represent a real
> content character, ZWNBSP or they could be a BOM. The label doesn't
> tell you.
I have never understood under what circumstances a ZWNBSP would ever
appear as the first character of a file. It wouldn't ma
"Michael (michka) Kaplan" wrote:
> > .xml UTF-8N Some XML processors may not cope with BOM
>
> Maybe they need to upgrade? Since people often edit the files in notepad,
> many files are going to have it. A parser that cannot accept this reality is
> not going to make it very long.
I didn't think
From: "Joseph Boyle" <[EMAIL PROTECTED]>
> These are listed as examples to demonstrate the idea of a configuration
file
> listing encoding constraints. The fact that each constraint is arguable is
a
> good reason to make the constraints configurable, and therefore to have
> names to distinguish BO
-
From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com]
Sent: Saturday, November 02, 2002 10:16 AM
To: Joseph Boyle; Mark Davis; Murray Sargent
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM
From: "Joseph Boyle" <[EMAIL PROTECTED]>
> Type Enco
From: "Joseph Boyle" <[EMAIL PROTECTED]>
> Type Encoding Comment
> .txt UTF-8BOM We want plain text files to have BOM to distinguish
> from legacy codepage files
Not really required, but optional -- the perfomance hit of making sure its
valid UTF-8 is pretty minor. But people do open some *huge*
Overington@;ngo.globalnet.co.uk]
Sent: Friday, November 01, 2002 10:37 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM
As you have UTF-8N where the N stands for the word "no" one could possibly
have UTF-8Y where the Y stands for the word "yes&q
-
From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com]
Sent: Saturday, November 02, 2002 4:18 AM
To: Mark Davis; Murray Sargent; Joseph Boyle
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM
From: "Mark Davis" <[EMAIL PROTECTED]>
> That is
From: "Mark Davis" <[EMAIL PROTECTED]>
> That is not sufficient. The first three bytes could represent a real
content
> character, ZWNBSP or they could be a BOM. The label doesn't tell you.
There are several problems with this supposition -- most notably the fact
that there are cases that specifi
uot;Murray Sargent" <[EMAIL PROTECTED]>
To: "Joseph Boyle" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Friday, November 01, 2002 12:42
Subject: RE: Names for UTF-8 with and without BOM
> Joseph Boyle says: "It would be useful to have official names to
&g
As you have UTF-8N where the N stands for the word "no" one could possibly
have UTF-8Y where the Y stands for the word "yes".
Thus one could have the name of the format answering, or not answering, the
following question.
Is there a BOM encoded?
However, using the letter Y has three disadvantage
Joseph Boyle says: "It would be useful to have official names to
distinguish UTF-8 with and without BOM."
To see if a UTF-8 file has no BOM, you can just look at the first three
bytes. Is this a problem? Typically when you care about a file's
encoding form, you plan to read the file.
Thanks
Murra
> Perhaps it
> is time to think of three other words starting with B, O, M that make a
> better explanation.)
Bollixed Operational Muddle ;-)
--Ken
It would be useful to have official names to distinguish UTF-8 with and
without BOM. (or, with, without, and agnostic) Here are a couple of examples
I'm currently involved with:
* I'm writing an encoding checker to validate a long list of text file
formats we use internally. HTML and XML only coun
45 matches
Mail list logo