Re: UTF-8N?

2000-06-28 Thread Doug Ewell
Asmus Freytag <[EMAIL PROTECTED]> wrote: >> Yes. The Unicode Standard will deprecate the use of U+FFEF (Note: not >> U+FFFE) as a zero-width non-breaking space (despite its formal name). >> >> And U+FFEF should *only* be used as a byte order mark and/or >> signature. (That is already ambiguous an

Re: UTF-8N?

2000-06-26 Thread Doug Ewell
Asmus Freytag <[EMAIL PROTECTED]> wrote: >> Yes. The Unicode Standard will deprecate the use of U+FFEF (Note: not >> U+FFFE) as a zero-width non-breaking space (despite its formal name). >> >> And U+FFEF should *only* be used as a byte order mark and/or >> signature. (That is already ambiguous an

Re: UTF-8N?

2000-06-26 Thread Asmus Freytag
At 05:29 AM 6/23/00 -0800, [EMAIL PROTECTED] wrote: > >Yes. The Unicode Standard will deprecate the use of U+FFEF (Note: not >U+FFFE) > >as a zero-width non-breaking space (despite its formal name). > > > >And U+FFEF should *only* be used as a byte order mark and/or signature. >(That > >is already

RE: UTF-8N?

2000-06-24 Thread John Cowan
On Fri, 23 Jun 2000, Preethi Balaji wrote: > BOM Byte Order Mark, the Unicode character U+FEFF. Because U+FFFE is permanently unassigned, U+FEFF can be used at the beginning of a Unicode file to mark it as big-endian or little-endian. If you read U+FFFE instead, you need to byte swap. > Is: P

RE: UTF-8N?

2000-06-24 Thread Jonathan Rosenne
> To: Unicode List > Cc: Unicode List > Subject: Re: UTF-8N? > > > At 10:54 PM 06/22/2000 -0800, Doug Ewell wrote: > >Now that Unicode plans to deprecate the use of U+FEFF as ZWNBSP, > >programs that *expect* UTF-8 instead of SBCS will be able to throw away > >

RE: UTF-8N?

2000-06-23 Thread Preethi Balaji
resending the message -Original Message- From: Preethi Balaji Sent: Friday, June 23, 2000 2:14 PM To: 'Kenneth Whistler'; Unicode List Cc: [EMAIL PROTECTED] Subject: RE: UTF-8N? Sorry to intrude, I am a new member to the group and was curious to know few terms being

Re: UTF-8N?

2000-06-23 Thread Kenneth Whistler
John Cowan wrote: > I think the implication is that the OS provides an interface to read > characters out of a text file, in which case BOM-eating BOMophagy, aka FEFFagy ;-) > (and masking the > difference between various text encodings) is very sensible. Historic > OSes have not had such an

Re: UTF-8N?

2000-06-23 Thread John Cowan
"Robert A. Rosenberg" wrote: > It would be very UNCool unless the application can tell the operating > system that it wants this done for it. Otherwise it will have no way of > KNOWING that the edited stream that the operating system is passing it IS > UTF-8 (and was so identified by the deleted

Re: UTF-8N?

2000-06-23 Thread Robert A. Rosenberg
At 10:54 PM 06/22/2000 -0800, Doug Ewell wrote: >Now that Unicode plans to deprecate the use of U+FEFF as ZWNBSP, >programs that *expect* UTF-8 instead of SBCS will be able to throw away >an initial U+FEFF with even greater confidence. It may even be possible >for operating system developers to b

Re: UTF-8N?

2000-06-23 Thread Markus Scherer
would it still be possible to disunify bom/signature (which would have to remain at feff) from zwnbsp? this seems to be the natural solution to this since then a signature character could always be ignored or stripped. markus

Re: UTF-8N?

2000-06-23 Thread Peter_Constable
Ken: >Yes. The Unicode Standard will deprecate the use of U+FFEF (Note: not U+FFFE) >as a zero-width non-breaking space (despite its formal name). > >And U+FFEF should *only* be used as a byte order mark and/or signature. (That >is already ambiguous and trouble enough -- without tossing in the o

Re: UTF-8N?

2000-06-23 Thread Peter_Constable
On 06/22/2000 10:54:35 PM <[EMAIL PROTECTED]> wrote: >Now that Unicode plans to deprecate the use of U+FEFF as ZWNBSP, programs that >*expect* UTF-8 instead of SBCS will be able to throw away an initial U+FEFF >with even greater confidence. It may even be possible for operating system >devel

Re: UTF-8N?

2000-06-23 Thread Doug Ewell
Kenneth Whistler <[EMAIL PROTECTED]> wrote: >> It all stems from the fact that U+FEFF is not only what is used for >> the BOM, but also a valid Unicode/ISO 10646 codepoint. The issue >> would be solved by deprecating the use of U+FEFF as a Unicode >> character (for example by defining a new code

Re: UTF-8N?

2000-06-22 Thread Kenneth Whistler
John Cowan wrote: > Kenneth Whistler wrote: > > > Now we are pushing through the long, bureaucratic process of getting > > this accepted into 10646-1, so it we maintain synchronicity with a > > joint publication of it as a *standard* character. > > So a fair statement of what you hope to achiev

Re: UTF-8N?

2000-06-22 Thread John Cowan
Kenneth Whistler wrote: > Now we are pushing through the long, bureaucratic process of getting > this accepted into 10646-1, so it we maintain synchronicity with a > joint publication of it as a *standard* character. So a fair statement of what you hope to achieve is: U+2060 will be the zero-wid

Re: UTF-8N?

2000-06-22 Thread Kenneth Whistler
Chris Fynn wrote: > [EMAIL PROTECTED] wrote: > > > ... I think the suggestion that BOM and ZWNBSP be > > de-unified, which I have heard before, may make the best sense. > > *If* that's the solution, it should be done yesterday. The longer it takes the > more implementations (and data) there wil

Re: UTF-8N?

2000-06-22 Thread Kenneth Whistler
Juliusz wrote: > The problem is not one of broken software. The problem is that, as > John Cowan explained in detail, with the addition of the BOM, UTF-8 > and UTF-16 become ambiguous. This is putting the cart before the horse. The U+FEFF BOM existed in Unicode 1.0, and was carried into ISO/IE

Re: UTF-8N?

2000-06-22 Thread John Cowan
"Ayers, Mike" wrote: > Am I reading this wrong? Here's what I get: > > I hand you a UTF-16 document. This document is: > > FE FF 00 48 00 65 00 6C 00 6C 00 6F > > ..so it says "Hello". Then I say, "Oh, by the way, that's > big-endian." *POOF* The content of the doc

Re: UTF-8N?

2000-06-22 Thread John Cowan
Antoine Leca wrote: > Now I ask a slighty different question then. What is the name of the > encoding where the byte order is known (for example, any application > on an Intel machine that receive its data from the system, as opposed > as from the network or similar hazardous source), and where a

RE: UTF-8N?

2000-06-22 Thread Ayers, Mike
> > On 06/22/2000 02:24:49 AM <[EMAIL PROTECTED]> wrote: > > >It was my understanding that U+FEFF when received as first character > should be > >seen as BOM and not as a character, and handled accordingly. > > When the encoding scheme is known to be UTF-16BE or UTF-16LE, > it *must not* > be

Re: UTF-8N?

2000-06-22 Thread Antoine Leca
[EMAIL PROTECTED] wrote: > > On 06/22/2000 02:24:49 AM <[EMAIL PROTECTED]> wrote: > > >It was my understanding that U+FEFF when received as first character > should be > >seen as BOM and not as a character, and handled accordingly. > > When the encoding scheme is known to be UTF-16BE or UTF-16L

Re: UTF-8N?

2000-06-22 Thread Christopher John Fynn
[EMAIL PROTECTED] wrote: > ... I think the suggestion that BOM and ZWNBSP be > de-unified, which I have heard before, may make the best sense. *If* that's the solution, it should be done yesterday. The longer it takes the more implementations (and data) there will be that needs to be changed.

Re: UTF-8N?

2000-06-22 Thread Peter_Constable
On 06/21/2000 06:33:57 PM <[EMAIL PROTECTED]> wrote: >> The standard doesn't ever discuss the BOM in the context of UTF-8, > >See section 13.6 (page 324). Sure enough. Well, there you go: the confusion is officially sanctioned! Peter Constable

Re: UTF-8N?

2000-06-22 Thread Peter_Constable
On 06/22/2000 02:24:49 AM <[EMAIL PROTECTED]> wrote: >It was my understanding that U+FEFF when received as first character should be >seen as BOM and not as a character, and handled accordingly. When the encoding scheme is known to be UTF-16BE or UTF-16LE, it *must not* be interpreted as a BO

Re: UTF-8N?

2000-06-22 Thread Peter_Constable
On 06/21/2000 03:09:43 PM <[EMAIL PROTECTED]> wrote: >Appropriate or not, users (you know, those people who don't read the >documentation that the programmers don't write) will use text editors to split >files. They will then concatenate the files using a non-Unicode aware tool. >And they wi

Re: UTF-8N?

2000-06-22 Thread Antoine Leca
0xFF 0x00 0x20 ... > UTF-16LE: 0xFF 0xFE 0x20 0x00 ... > UTF-8N: 0xEF 0xBB 0xBF 0x20 ... > UTF-8B: 0xEF 0xBB 0xBF 0xEF 0xBB 0xBF 0x20 ... There is something I should have missed. It was my understanding that U+FEFF when received as first character should be seen as BOM and not as a charac

RE: UTF-8N?

2000-06-21 Thread Ayers, Mike
> (Who should I contact to register ``UCS-4PDP11'', the mixed-endian > form of UCS-4?) We want UTF-16PDP!!! Reminds me of a story - father, son, mule - told by some guy named Aesop... :-p /|/|ike

Re: UTF-8N?

2000-06-21 Thread Juliusz Chroboczek
(I've allowed myself to quote from a number of distinct posts.) DE> On the contrary, I thought Peter's point was that the OS (or the DE> split/ merge programs) should *not* make any special assumptions DE> about text files. Sorry if I wasn't clear. I was taking for granted that OSes will not re

Re: UTF-8N?

2000-06-21 Thread John Cowan
[EMAIL PROTECTED] wrote: > The BOM is explicitly not to be interpreted as part of > the text stream. D35 (U3, p47) states (at least for UTF-16): > > "The byte order mark is not considered part of the content of the text." Absolutely. What that means is that if there is a BOM, it is not transla

Re: UTF-8N?

2000-06-21 Thread Peter_Constable
and the same objections raised above still apply. >Without distinct labels UTF-8N and UTF-8B (or whatever), we cannot tell if the >byte sequence 0xEF 0xBB 0xBF 0x20 should be decoded as U+0020 or U+FEFF >U+0020. This is exactly analogous to the statement that without distinct >labels U

Re: UTF-8N?

2000-06-21 Thread John Cowan
: 0x00 0x20 ... UTF-16LE: 0x20 0x00 ... UTF-8N: 0x20 ... UTF-8B: 0xEF 0xBB 0x BF 0x20 ... Now suppose we have a character sequence beginning with U+FEFF U+0020. This would be encoded as follows: US-ASCII: (not possible) UTF-16: 0xFE 0xFF 0xFE 0xFF 0x00 0x20 ... UTF-16: 0xFF 0xFE 0xFF 0xF

Re: UTF-8N?

2000-06-21 Thread Peter_Constable
Eh??? John, either I'm really missing your intent, or you're saying something that I know you don't mean. U+0020 in UTF-8 is always 0x20, whether or not the file begins with a BOM. While I haven't met you in person, I've learned enough about you by email that I'm pretty sure you know this already

Re: UTF-8N?

2000-06-21 Thread John Cowan
[EMAIL PROTECTED] wrote: > UTF-8 files both with and without a BOM serialize the character > representations into bytes (octets) in exactly the same way. That's the > basis for distinguishing between encoding schemes, and since there isn't a > difference, there is only one encoding scheme involve

Re: UTF-8N?

2000-06-21 Thread Peter_Constable
On 06/20/2000 08:20:53 PM <[EMAIL PROTECTED]> wrote: [snip] >It may be useful shorthand to define the term "UTF-8N" to refer to UTF-8 text >that does not begin with a BOM, and reserve the term "UTF-8" for text that >*does* begin with a BOM, "UTF-8&qu

Re: UTF-8N?

2000-06-20 Thread Doug Ewell
bsence of initial U+FEFF. That is, if my program insists that a certain file begin with the characters "#!" (U+0023 U+0021), and I want this file to be encoded in UTF-8, then I would want to modify the program so that the file could optionally begin with U+FEFF U+0023 U+0021 instead. It

Re: UTF-8N?

2000-06-20 Thread Peter_Constable
JC> At one point, I thought that with Unicode there would be only one JC>cross-platform encoding... Right now, it looks like there will be JC>at least 8 Unicode encodings, JC>Substantively, the only real encodings are UTF-8 and UTF-16. We are being a little sloppy in terminology, which isn't a

RE: UTF-8N?

2000-06-20 Thread Ayers, Mike
> From: Juliusz Chroboczek [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, June 20, 2000 12:02 PM > > Of course, no mismatch happens if the OS keeps track of file types. > Splitting in the octet manner a text/plain file leads to two > octet-stream files, and the OS should ensure that you cannot merge

Re: UTF-8N?

2000-06-20 Thread John Cowan
. > > Later, I though that there would be two Unicode encodings, the ones > that are now called UTF-16BE and UTF-8N. I was prepared to live with > that. > > Right now, it looks like there will be at least 8 Unicode encodings, > at least 4 of whic will be in common use (big-

Re: UTF-8N?

2000-06-20 Thread Juliusz Chroboczek
plain text file from a Mac and a plain text file from a Windows machine would be the same thing (up to some uninteresting variations in line ending). Later, I though that there would be two Unicode encodings, the ones that are now called UTF-16BE and UTF-8N. I was prepared to live with that. Ri

Re: UTF-8N?

2000-06-20 Thread Peter_Constable
MD>In XML, this situation does not arise, since it specifies the exact useage of BO M, but it can arise in other circumstances. Another recent thread suggests that the situation with BOM and XML is, in fact, *not* clear. >AL> I understand there is no way to know whether you SHALL/SHOULD/MAY A

Re: UTF-8N?

2000-06-20 Thread John Cowan
Juliusz Chroboczek wrote: > Later on, you merge the two files, and compute the checksum of the > concatenated file. If the program used for splitting inserted a BOM, > but the program used for merging didn't remove it, the checksum > comparison is going to fail. Even worse: If the split point

Re: UTF-8N?

2000-06-20 Thread Juliusz Chroboczek
AL> I understand there is no way to know whether you SHALL/SHOULD/MAY AL> delete it or not, but I fail to see the danger: BOM (well, ZWNBSP) AL> cannot carry any useful meaning when it appears at the beginning AL> of a text, can it? So what can be the problem? You have a large plain-text Unicode

RE: UTF-8N?

2000-06-20 Thread Michael Kaplan (Trigeminal Inc.)
:[EMAIL PROTECTED]] > Sent: Tuesday, June 20, 2000 8:56 AM > To: Unicode List > Cc: Unicode List > Subject: Re: UTF-8N? > > Mark Davis wrote: > > > > The reason I make that notational distinction in the text is that there > is a danger > >

Re: UTF-8N?

2000-06-20 Thread Antoine Leca
Mark Davis wrote: > > The reason I make that notational distinction in the text is that there is a danger > with UTF-8 currently: BOM can be used with it, and some people do. Since, unlike > the case of UTF-16 / UTF-16BE / UTF-16LE, there is no way to distinguish between > implementations that al

Re: UTF-8N?

2000-06-20 Thread Mark Davis
I want to make sure that people are not mislead by that paper. There is a note below that section that: "Note: The italicized names are not yet registered, but are useful for reference." and "UTF-8N" is italicized. It is not a registered name, and should not be used outsid

UTF-8N?

2000-06-19 Thread Masahiko Maedera
I found UTF-8N in the following URL. www-4.ibm.com/software/developer/library/utfencodingforms/index.html I have understood the meaning and the format of UTF-8N. But I don't make sure how it will be treated in future. Does anyone have plan to regist new charset UTF-8N, or any other inform