RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-20 Thread Joseph Boyle
would produce internal ZWNBSPs is not part of any of our processing as far as I know. -Original Message- From: David Starner [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 07, 2002 12:14 PM To: Markus Scherer Cc: unicode Subject: Re: Names for UTF-8 with and without BOM - pragmatic O

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread David Starner
On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote: > The fact is that Windows uses UTF-8 and UTF-16 plain text files with > signatures (BOMs) very simply, gracefully, and successfully. It has applied > what I called the "pragmatic" approach here for about 10 years. It just > works.

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread Kent Karlsson
> Initial for each piece, as each is assumed to be a complete > text file before concatenation. Nothing > prevents copy/cp/cat and other commands from recognizing > Unicode signatures, for as long as they > don't claim to preserve initial U+FEFF. Yes there is, in a formal sense, for cat and c

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Markus Scherer
Lars Kristan wrote: Markus Scherer wrote: If software claims that it does not modify the contents of a document *except* for initial U+FEFF then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed if software claims to not modify text then one need

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Kent Karlsson
> True, UTF-16 files do need a signature. Eh, no! "UTF-16BE" and "UTF-16LE" files (or whatever kind of text data element) do not have any signature/BOM. Not even files (somehow) labelled "UTF-16" need have a signature/BOM, without a BOM they are then the same as if it was labelled "UTF-16BE".

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Marco Cimarosti
Lars Kristan wrote: > > .txtUTF-8 require We want plain text files to > > have BOM to distinguish > > from legacy codepage files > > H, what does "plain" mean?! Perhaps files with a BOM > should be called "text" files (or .txt files;) as > opp

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Lars Kristan
Markus Scherer wrote: > If software claims that it does not modify the contents of a > document *except* for initial U+FEFF > then it can do with initial U+FEFF what it wants. If the > whole discussion hinges on what is allowed > if software claims to not modify text then one need > not claim

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-05 Thread Markus Scherer
Mark Davis wrote: Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are not modifying the contents). This points to a pragmatic way to deal with this issue: If software claims that it does n

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
;; "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Sunday, November 03, 2002 13:02 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > Ironic that for the purpose of dealing with THREE bytes that so many bytes >

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
t; Sent: Sunday, November 03, 2002 13:02 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > Ironic that for the purpose of dealing with THREE bytes that so many bytes > are being wasted. :-) > > > Little probabilit

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
t; Sent: Sunday, November 03, 2002 13:02 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > Ironic that for the purpose of dealing with THREE bytes that so many bytes > are being wasted. :-) > > > Little probabilit

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Doug Ewell
Mark Davis wrote: > Little probability that right double quote would appear at the start > of a document either. Doesn't mean that you are free to delete it > (*and* say that you are not modifying the contents). True, but right double quote: (a) has a visible glyph with a well-defined human-rea

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Michael \(michka\) Kaplan
From: "Mark Davis" <[EMAIL PROTECTED]> Ironic that for the purpose of dealing with THREE bytes that so many bytes are being wasted. :-) > Little probability that right double quote would appear at the start of a > document either. Doesn't mean that you are free to delete it (*and* say that > you

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
"Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent" <[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Saturday, November 02, 2002 04:18 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark D

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
t;Murray Sargent" <[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]> Sent: Saturday, November 02, 2002 13:27 Subject: Re: Names for UTF-8 with and without BOM > Mark Davis wrote: > > > That is not sufficient. The first three bytes could represent a real &g

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread John Cowan
[EMAIL PROTECTED] scripsit: > I find it interesting, then, to see Michael saying that, since Notepad > sticks a BOM-cum-signature at the start of its UTF-8, the rest of the > world should support it. There is another argument, viz. ISO/IEC 10646, which plainly proclaims that the 8-BOM is a vali

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Michael \(michka\) Kaplan
From: <[EMAIL PROTECTED]> > In particular, I'm thinking of a situation about a year and a half ago > (IIRC) in which Michael (and I and others) were strongly opposed to a > suggestion that the Unicode Consortium should document a certain variation > (perversion, some would say) of one of the Unico

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Peter_Constable
On 11/02/2002 12:15:54 PM "Michael \(michka\) Kaplan" wrote: >> .xml UTF-8N Some XML processors may not cope with BOM > >Maybe they need to upgrade? Since people often edit the files in notepad, >many files are going to have it. A parser that cannot accept this reality is >not going to make it ve

RE: Names for UTF-8 with and without BOM

2002-11-03 Thread Peter_Constable
On 11/02/2002 11:59:24 AM "Joseph Boyle" wrote: >The first time I thought of UTF-8Y it sounded too flippant, but actually it >is fairly self-explanatory if UTF-8 is taken as a given, and has the virtue >of being short. UTF-8Y (and UTF-8J) is not at all intuitive. "UTF-8-yuk"? The better counte

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John Cowan wrote: > > Tex Texin scripsit: > > > Interestingly, although I didn't study it in detail, looking at rfc 2376 > > for prioritization over charset conflicts, it seems to recommend > > stripping the BOM when converting from utf-16 to other charsets (and > > without considering that ucs

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Doug, Doug Ewell wrote: > > Tex Texin wrote: > > > However, I didn't realize that parsers were to allow for the > > possibility of different signatures. > > So a parser has to worry about scsu signatures, etc > > A parser only *has* to read UTF-8 without signature and UTF-16 with > signatu

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: > Interestingly, although I didn't study it in detail, looking at rfc 2376 > for prioritization over charset conflicts, it seems to recommend > stripping the BOM when converting from utf-16 to other charsets (and > without considering that ucs-4 would like to keep it). (section

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John Cowan wrote: > > Tex Texin scripsit: > > > So when the parser gets JOECODE, I can understand ignoring the signature > > and autodetection, but exactly how does it find the first "<"? > > Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might > be UTF-32 big-endian, but we

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Tex Texin wrote: > However, I didn't realize that parsers were to allow for the > possibility of different signatures. > So a parser has to worry about scsu signatures, etc A parser only *has* to read UTF-8 without signature and UTF-16 with signature. It *may* read other encodings of its ow

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
unrealistic one. MichKa - Original Message - From: "Tex Texin" <[EMAIL PROTECTED]> To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> Cc: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Saturday, November 02, 2002 11:0

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: > So when the parser gets JOECODE, I can understand ignoring the signature > and autodetection, but exactly how does it find the first "<"? Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might be UTF-32 big-endian, but we'll suppose the parser can't handle

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John, I understand the flexibility of XML to use different encodings. However, I didn't realize that parsers were to allow for the possibility of different signatures. So a parser has to worry about scsu signatures, etc Whereas XML is so fussy about which characters it accepts, I am surprised

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Hi John, I meant the character "<". As for notepad, what I should have either stated more completely or bit my tongue, is that where there is a standard in place (and where it is unambiguous) the mistakes of particular products shouldn't hold sway, unless they are tantamount to a de facto standard

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: > However, that leaves open the question whether only the Unicode > transform signatures are acceptable or other signatures are also > allowed. So if a vendor defines a code page, and defines a signature > (perhaps mapping BOM/ZWNSP specifically to some code point or byte > str

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: > I didn't think the XML standard allowed for utf-8 files to have a BOM. This capability was never actually excluded, and was added by erratum (and force-majeure, when it became clear that BOMful UTF-8 was going to start becoming common). XML files are intended to be plain te

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Thanks Doug. I had looked at the standard not at the appendix. I think that (non-normative) appendix is unfortunate. It seems to imply (to my mind) that if other character sets define BOMs that it is ok to use them as XML signatures. My reasoning is that the standard itself only says that UTF-16

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Tex Texin wrote: > I didn't think the XML standard allowed for utf-8 files to have a BOM. > The standard is quite clear about requiring 0xFEFF for utf-16. > I would have thought a proper parser would reject a non-utf-16 file > beginning with something other than "<". The standard explicitly allo

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Mark Davis wrote: > That is not sufficient. The first three bytes could represent a real > content character, ZWNBSP or they could be a BOM. The label doesn't > tell you. I have never understood under what circumstances a ZWNBSP would ever appear as the first character of a file. It wouldn't ma

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
"Michael (michka) Kaplan" wrote: > > .xml UTF-8N Some XML processors may not cope with BOM > > Maybe they need to upgrade? Since people often edit the files in notepad, > many files are going to have it. A parser that cannot accept this reality is > not going to make it very long. I didn't think

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: "Joseph Boyle" <[EMAIL PROTECTED]> > These are listed as examples to demonstrate the idea of a configuration file > listing encoding constraints. The fact that each constraint is arguable is a > good reason to make the constraints configurable, and therefore to have > names to distinguish BO

RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
- From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com] Sent: Saturday, November 02, 2002 10:16 AM To: Joseph Boyle; Mark Davis; Murray Sargent Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM From: "Joseph Boyle" <[EMAIL PROTECTED]> > Type Enco

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: "Joseph Boyle" <[EMAIL PROTECTED]> > Type Encoding Comment > .txt UTF-8BOM We want plain text files to have BOM to distinguish > from legacy codepage files Not really required, but optional -- the perfomance hit of making sure its valid UTF-8 is pretty minor. But people do open some *huge*

RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
Overington@;ngo.globalnet.co.uk] Sent: Friday, November 01, 2002 10:37 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM As you have UTF-8N where the N stands for the word "no" one could possibly have UTF-8Y where the Y stands for the word "yes&q

RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
- From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com] Sent: Saturday, November 02, 2002 4:18 AM To: Mark Davis; Murray Sargent; Joseph Boyle Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM From: "Mark Davis" <[EMAIL PROTECTED]> > That is

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: "Mark Davis" <[EMAIL PROTECTED]> > That is not sufficient. The first three bytes could represent a real content > character, ZWNBSP or they could be a BOM. The label doesn't tell you. There are several problems with this supposition -- most notably the fact that there are cases that specifi

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Mark Davis
uot;Murray Sargent" <[EMAIL PROTECTED]> To: "Joseph Boyle" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Friday, November 01, 2002 12:42 Subject: RE: Names for UTF-8 with and without BOM > Joseph Boyle says: "It would be useful to have official names to &g

Re: Names for UTF-8 with and without BOM

2002-11-01 Thread William Overington
As you have UTF-8N where the N stands for the word "no" one could possibly have UTF-8Y where the Y stands for the word "yes". Thus one could have the name of the format answering, or not answering, the following question. Is there a BOM encoded? However, using the letter Y has three disadvantage

RE: Names for UTF-8 with and without BOM

2002-11-01 Thread Murray Sargent
Joseph Boyle says: "It would be useful to have official names to distinguish UTF-8 with and without BOM." To see if a UTF-8 file has no BOM, you can just look at the first three bytes. Is this a problem? Typically when you care about a file's encoding form, you plan to read the file. Thanks Murra

Re: Names for UTF-8 with and without BOM

2002-11-01 Thread Kenneth Whistler
> Perhaps it > is time to think of three other words starting with B, O, M that make a > better explanation.) Bollixed Operational Muddle ;-) --Ken

Names for UTF-8 with and without BOM

2002-11-01 Thread Joseph Boyle
It would be useful to have official names to distinguish UTF-8 with and without BOM. (or, with, without, and agnostic) Here are a couple of examples I'm currently involved with: * I'm writing an encoding checker to validate a long list of text file formats we use internally. HTML and XML only coun