At 23:27 -0700 2002-04-11, Doug Ewell wrote:
George W Gerrity [EMAIL PROTECTED] wrote:
To expand on this, imagine there is a text file in some encoding on
some medium created by a little-endian machine (say a DEC Vax or a
Macintosh 68000), and it is to be accessed on a big-endian machine
George W Gerrity wrote:
As for bytes - well, which is the most significant byte?! Your
statement is wrong there. Transmission is done by memory address,
not by significance. Low to high, of course.
#10. Not quite correct, but see below.
The fact that processor architectures store data
OK, I admit. I have lied. A little bit. I have not executed the test on a BE
platform, I have simply typed what I knew the output would be. But I got
away with it, heh heh.
But now I did it for real, on both architecture types, because I got curious
about the floats...
So:
#include fcntl.h
George W Gerrity wrote:
_All_ of these accessing methods are
either bit-serial or byte-serial, transmitting the most significant
bit of the most significant byte first, and the little/big-endian
storage in the RAM receiving buffers is done correctly by the target
machine.
As for bits,
George W Gerrity wrote:
To expand on this, imagine there is a text file in some encoding on some
medium created by a little-endian machine (say a DEC Vax or a Macintosh
68000), and it is to be accessed on a big-endian machine (any Intel 8080
-- Pentium architecture). Unless the two CPUs
Just to set the historical record straight,
Little Endian Machines include
DEC VAX (and PDP-11 before it)
Intel x86 (and 8080 before it)
Big Endian Machines include
Macintosh, both 68000 and PowerPC
-- Andy Heninger
[EMAIL PROTECTED]
To expand on this, imagine there is
On Friday, April 12, 2002, at 10:38 , George W Gerrity wrote:
To expand on this, imagine there is a text file in some encoding on
some medium created by a little-endian machine (say a DEC Vax or a
Macintosh 68000), and it is to be accessed on a big-endian machine (any
Intel 8080 -- Pentium
Mark Davis [EMAIL PROTECTED] wrote:
- when one of the BOM-allowing UTFs starts with a BOM, you know the
encoding*, and you strip off the BOM when you get the content.
*assuming that no UTF-16 file has U+ as the first character.
In the real world, this is a pretty good assumption --
Doug Ewell wrote:
As Shlomi points out, Microsoft products do not treat UTF-7
specially, except that IE recognizes the UTF-7 BOM and sets its encoding
accordingly (but this is true for any UTF-7 sequence, not just the BOM;
try loading a text file containing only the 11 ASCII characters
/Unix BOM FAQ again (small fix)
Mark Davis [EMAIL PROTECTED] wrote:
- when one of the BOM-allowing UTFs starts with a BOM, you know
the
encoding*, and you strip off the BOM when you get the content.
*assuming that no UTF-16 file has U+ as the first character.
In the real world
Mark Davis [EMAIL PROTECTED] wrote:
- when one of the BOM-allowing UTFs starts with a BOM, you know the
encoding*, and you strip off the BOM when you get the content.
*assuming that no UTF-16 file has U+ as the first character.
In the real world, this is a pretty good assumption
This thread seems just about ended, and I don't want to be the person
to revive it, but there have been numerous related topics in the past
six months, and nothing in them answers the question that has been
nagging me.
The question is
Considering the difficulty af actually getting access to
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this
seems to be something that the _application_ has to decide, not the _converter_ that
the application instantiates.
This converter name is (currently) only a convenience alias for use the UTF-16 byte
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM
is that this seems to be something that the _application_ has to decide,
not the _converter_ that the application instantiates.
This converter name is (currently) only a convenience alias for use the
UTF-16 byte
Rick Cameron wrote:
So the original statement was correct. If the file starts with FF FE, it
must be a little-endian encoding; but you can't tell whether it's UTF-16 or
UTF-32.
If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE
is unambiguous.
If you
PROTECTED]; Kenneth Whistler
[EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, April 10, 2002 09:45
Subject: RE: MS/Unix BOM FAQ again (small fix)
So the original statement was correct. If the file starts with FF FE,
it
must be a little-endian encoding; but you can't tell whether it's
UTF-16
If you look for any Unicode signature, then you look for FF
FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE).
FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM
followed by a UTF-16 U+. Yes, the NULL is usually not thought of as text,
but there's no knowing
: MS/Unix BOM FAQ again (small fix)
If you look for any Unicode signature, then you look for FF
FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE).
FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE
BOM
followed by a UTF-16 U+. Yes, the NULL is usually not thought
A small fix for the FAQ; specifically, a fix for the typo/braino of
construing 0x071F as little-endian 1F 70 instead of (the now fixed) 1F 07.
Thanks to Wladislaw Vaintroub for pointing it out for me.
--- BEGIN ---
Microsoft Unicode Text File Byte Order Mark (BOM) FAQ
by Shlomi Tal ([EMAIL
σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
http://www.macchiato.com
- Original Message -
From: Shlomi Tal [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, April 09, 2002 10:43
Subject: MS/Unix BOM FAQ again (small fix)
A small fix for the FAQ
beginning with bytes FE FF:
- UTF-16 = big endian, omitted from contents
- UTF-16BE = ZWNBSP
- UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE = malformed, file
corrupted
Isn't FF FE 00 00 a valid sequence for UTF-32LE?
It looks to me like Shlomi's chart and Mark's chart for interpretting
the BOMs are describing slightly different situations.
Mark's table assumes that you have the BOM and some other additional
indication of the data's encoding - a charset= declaration, or an
xml encoding declaration, or
This is incorrect. Here is a summary of the meaning of those bytes at
the start of text files with different Unicode encoding forms.
beginning with bytes FE FF:
- UTF-16 = big endian, omitted from contents
beginning with bytes FF FE:
- UTF-16 = little endian, omitted from contents
, 2002 15:43
Subject: Re: MS/Unix BOM FAQ again (small fix)
It looks to me like Shlomi's chart and Mark's chart for
interpretting
the BOMs are describing slightly different situations.
Mark's table assumes that you have the BOM and some other additional
indication of the data's encoding
Shlomi Tal [EMAIL PROTECTED] wrote:
Microsoft Unicode Text File Byte Order Mark (BOM) FAQ
...
There is another, very common Unicode encoding scheme called UTF-8,
which maps the Unicode repertoire into sequences of bytes. Since
the order of bytes (as opposed to words of more than one byte)
- Original Message -
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, April 09, 2002 19:23
Subject: Re: MS/Unix BOM FAQ again (small fix)
I agree, there are different ways to look at it. But the statement
26 matches
Mail list logo