RE: MS/Unix BOM FAQ again (small fix)

2002-04-13 Thread George W Gerrity
At 23:27 -0700 2002-04-11, Doug Ewell wrote: George W Gerrity [EMAIL PROTECTED] wrote: To expand on this, imagine there is a text file in some encoding on some medium created by a little-endian machine (say a DEC Vax or a Macintosh 68000), and it is to be accessed on a big-endian machine

[OT] RE: MS/Unix BOM FAQ again (small fix)

2002-04-13 Thread Lars Kristan
George W Gerrity wrote: As for bytes - well, which is the most significant byte?! Your statement is wrong there. Transmission is done by memory address, not by significance. Low to high, of course. #10. Not quite correct, but see below. The fact that processor architectures store data

RE: [OT] RE: MS/Unix BOM FAQ again (small fix)

2002-04-13 Thread Lars Kristan
OK, I admit. I have lied. A little bit. I have not executed the test on a BE platform, I have simply typed what I knew the output would be. But I got away with it, heh heh. But now I did it for real, on both architecture types, because I got curious about the floats... So: #include fcntl.h

RE: MS/Unix BOM FAQ again (small fix)

2002-04-12 Thread Lars Kristan
George W Gerrity wrote: _All_ of these accessing methods are either bit-serial or byte-serial, transmitting the most significant bit of the most significant byte first, and the little/big-endian storage in the RAM receiving buffers is done correctly by the target machine. As for bits,

Re: MS/Unix BOM FAQ again (small fix)

2002-04-12 Thread Markus Scherer
George W Gerrity wrote: To expand on this, imagine there is a text file in some encoding on some medium created by a little-endian machine (say a DEC Vax or a Macintosh 68000), and it is to be accessed on a big-endian machine (any Intel 8080 -- Pentium architecture). Unless the two CPUs

Re: MS/Unix BOM FAQ again (small fix)

2002-04-12 Thread Andy Heninger
Just to set the historical record straight, Little Endian Machines include DEC VAX (and PDP-11 before it) Intel x86 (and 8080 before it) Big Endian Machines include Macintosh, both 68000 and PowerPC -- Andy Heninger [EMAIL PROTECTED] To expand on this, imagine there is

Endian Checker [Was: Re: MS/Unix BOM FAQ again (small fix))

2002-04-12 Thread Dan Kogai
On Friday, April 12, 2002, at 10:38 , George W Gerrity wrote: To expand on this, imagine there is a text file in some encoding on some medium created by a little-endian machine (say a DEC Vax or a Macintosh 68000), and it is to be accessed on a big-endian machine (any Intel 8080 -- Pentium

Re: MS/Unix BOM FAQ again (small fix)

2002-04-11 Thread Doug Ewell
Mark Davis [EMAIL PROTECTED] wrote: - when one of the BOM-allowing UTFs starts with a BOM, you know the encoding*, and you strip off the BOM when you get the content. *assuming that no UTF-16 file has U+ as the first character. In the real world, this is a pretty good assumption --

Re: MS/Unix BOM FAQ again (small fix)

2002-04-11 Thread Otto Stolz
Doug Ewell wrote: As Shlomi points out, Microsoft products do not treat UTF-7 specially, except that IE recognizes the UTF-7 BOM and sets its encoding accordingly (but this is true for any UTF-7 sequence, not just the BOM; try loading a text file containing only the 11 ASCII characters

Re: MS/Unix BOM FAQ again (small fix)

2002-04-11 Thread Mark Davis
/Unix BOM FAQ again (small fix) Mark Davis [EMAIL PROTECTED] wrote: - when one of the BOM-allowing UTFs starts with a BOM, you know the encoding*, and you strip off the BOM when you get the content. *assuming that no UTF-16 file has U+ as the first character. In the real world

RE: MS/Unix BOM FAQ again (small fix)

2002-04-11 Thread jarkko . hietaniemi
Mark Davis [EMAIL PROTECTED] wrote: - when one of the BOM-allowing UTFs starts with a BOM, you know the encoding*, and you strip off the BOM when you get the content. *assuming that no UTF-16 file has U+ as the first character. In the real world, this is a pretty good assumption

Re: MS/Unix BOM FAQ again (small fix)

2002-04-11 Thread George W Gerrity
This thread seems just about ended, and I don't want to be the person to revive it, but there have been numerous related topics in the past six months, and nothing in them answers the question that has been nagging me. The question is Considering the difficulty af actually getting access to

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Yves Arrouye
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
Rick Cameron wrote: So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE is unambiguous. If you

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis
PROTECTED]; Kenneth Whistler [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wednesday, April 10, 2002 09:45 Subject: RE: MS/Unix BOM FAQ again (small fix) So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread jarkko . hietaniemi
If you look for any Unicode signature, then you look for FF FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE). FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM followed by a UTF-16 U+. Yes, the NULL is usually not thought of as text, but there's no knowing

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis
: MS/Unix BOM FAQ again (small fix) If you look for any Unicode signature, then you look for FF FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE). FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM followed by a UTF-16 U+. Yes, the NULL is usually not thought

MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Shlomi Tal
A small fix for the FAQ; specifically, a fix for the typo/braino of construing 0x071F as little-endian 1F 70 instead of (the now fixed) 1F 07. Thanks to Wladislaw Vaintroub for pointing it out for me. --- BEGIN --- Microsoft Unicode Text File Byte Order Mark (BOM) FAQ by Shlomi Tal ([EMAIL

Re: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Mark Davis
σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com - Original Message - From: Shlomi Tal [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, April 09, 2002 10:43 Subject: MS/Unix BOM FAQ again (small fix) A small fix for the FAQ

RE: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Richard, Francois M
beginning with bytes FE FF: - UTF-16 = big endian, omitted from contents - UTF-16BE = ZWNBSP - UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE = malformed, file corrupted Isn't FF FE 00 00 a valid sequence for UTF-32LE?

Re: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Andy Heninger
It looks to me like Shlomi's chart and Mark's chart for interpretting the BOMs are describing slightly different situations. Mark's table assumes that you have the BOM and some other additional indication of the data's encoding - a charset= declaration, or an xml encoding declaration, or

RE: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Yves Arrouye
This is incorrect. Here is a summary of the meaning of those bytes at the start of text files with different Unicode encoding forms. beginning with bytes FE FF: - UTF-16 = big endian, omitted from contents beginning with bytes FF FE: - UTF-16 = little endian, omitted from contents

Re: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Mark Davis
, 2002 15:43 Subject: Re: MS/Unix BOM FAQ again (small fix) It looks to me like Shlomi's chart and Mark's chart for interpretting the BOMs are describing slightly different situations. Mark's table assumes that you have the BOM and some other additional indication of the data's encoding

Re: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Doug Ewell
Shlomi Tal [EMAIL PROTECTED] wrote: Microsoft Unicode Text File Byte Order Mark (BOM) FAQ ... There is another, very common Unicode encoding scheme called UTF-8, which maps the Unicode repertoire into sequences of bytes. Since the order of bytes (as opposed to words of more than one byte)

Re: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Mark Davis
- Original Message - From: Kenneth Whistler [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Tuesday, April 09, 2002 19:23 Subject: Re: MS/Unix BOM FAQ again (small fix) I agree, there are different ways to look at it. But the statement