Re: Byte Order Marks
Yves, we are thinking about a general API for encoding detection that could initially just check for BOM/Unicode signatures. I believe we have a feature request for this already. Mark and I just brainstormed about what we may want an API look like. The reason for doing what ICU is doing currently is simple pragmatism. None of our converters auto-detects anything, and they write only what you tell them to write. When you deal with serialized data structures and fields in files or databases, that is exactly what you want. With signature-carrying files and transmission protocols, there is more work necessary. It seems to me that a converter API with its ability to take one byte at a time, and no other way to pass additional information ("I know the language of the text..."), is not the best way to implement this. On output, writing a BOM/signature is easy: if you know you need one, then just call the converter once with U+feff. Although, with this one feature, I could imagine having an API "emit a Unicode signature if you are a converter for a Unicode encoding". markus
RE: Byte Order Marks
> On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote: > > On the other hand, if you get a file from your platform and > it is in 16-bit Unicode, then you would appreciate the > convenience of the auto-endian alias. > > But nothing should be spitting out platform-endian UTF-16! In the > case that there's a lot of unmarked big-endian UTF-16 around (as I > understand the ISO-10646 standard recommends), then that assumption > that everything emits unmaked platform-dependent UTF-16 will be > wrong. And for reference, on Windows, Unicode files are recognized because they have a BOM. Write plain UTF-16LE w/o a BOM, and your file won't be recognized properly. Manipulation of these files w/ ICU today is a bit painful, since one needs to strip the BOM on input (if I understand Markus correctly) and write a BOM at output. So these cannot be manipulated using applications like uconv which blindly uses the raw converters. YA
RE: Byte Order Marks
> > Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not > > UTF16_BigEndian? > > ICU does not do Unicode-signature or other encoding detection > as part of a converter. When you get text from some protocol, > you need to instantiate a converter according to what you > know about the encoding. So I can't pass it some text with a BOM and say "utf-16" and let it run through that. I guess that explains why I also didn't find converters that write a BOM at the start of the conversion. Is that something that would added to ICU in the future? It would be very nice to have a converter that would pick the BOM (and write it back). And yes, most of the time, when you stay on a given platform, it is very convenient to use the platform's endianness. My question was rather "why isn't UTF-16 the one that detects the BOM and defaults to an externalized form, BE, and then people on a given platform would just use UTF-16PE (of which UTF-16 is an alias in ICU)?". That would facilitate interchange of information. YA
Re: Byte Order Marks
On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote: > On the other hand, if you get a file from your platform and it is in 16-bit Unicode, >then you would appreciate the convenience of the auto-endian alias. But nothing should be spitting out platform-endian UTF-16! In the case that there's a lot of unmarked big-endian UTF-16 around (as I understand the ISO-10646 standard recommends), then that assumption that everything emits unmaked platform-dependent UTF-16 will be wrong. (It's never right to have a program emit platform-dependent-endian UTF-16 except in the case of system-local cache files. That breaks interoperating between your program on different systems.) -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg
Re: Byte Order Marks
Yves Arrouye wrote: > > If you don't have any clue about the byte order, but you know it is > UTF-16, then assume BE. > Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not > UTF16_BigEndian? ICU does not do Unicode-signature or other encoding detection as part of a converter. When you get text from some protocol, you need to instantiate a converter according to what you know about the encoding. Note that guessing big-endian is only the last, desperate part of detecting the encoding. It is not the first choice. If the text is properly tagged (including maybe a signature), then you will never have to open a "UTF-16" converter. On the other hand, if you get a file from your platform and it is in 16-bit Unicode, then you would appreciate the convenience of the auto-endian alias. markus
Fwd: Re: Byte Order Marks
>Date: Thu, 19 Apr 2001 12:59:43 -0700 >To: Tomas McGuinness <[EMAIL PROTECTED]> >From: Asmus Freytag <[EMAIL PROTECTED]> >Subject: Re: Byte Order Marks > >At 02:58 PM 4/19/01 +0200, you wrote: >>If its absent is it safe to assume any particular order (i.e. Big or >>Little Endian?) The default order is Big endian, but I wouldn't call that a 'safe' assumption. In the most general case I would attempt an autorecognition in the unlabelled case. Where a particular protocol's specification reinforces that the default order SHALL apply for the unlabelled case, the assumption becomes that much stronger, of course. A./ PS: as an aside: the SCSU encoder can be used to do this form of autorecognition. If text shows much better compression in one byte order than the other, that byte order is overwhelmingly likely to be the true one. The exception would be strings of pure Han ideographs. For these it's necessary to
RE: Byte Order Marks
> If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? I know that was a difference between ICU and my library, and when I asked this question a while ago I was told that despite what some litterature suggests, w/o any clue, platform endianness should be used. That's contradictory. YA
Re: Byte Order Marks
There is an RFC about UTF-16 that explains this: If the text is labeled by the protocol as charset=UTF-16 then the first two bytes are the byte order mark charset=UTF-16BE then it is big-endian and the first two bytes are just text charset=UTF-16LE then it is little-endian and the first two bytes are just text If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Similar for UTF-32[BE/LE]. If you don't know anything about your text, then you may start some heuristics or reject the text... markus Tomas McGuinness wrote: > A quick question relating to the Byte Order Mark of UCS-2. If its absent is > it safe to assume any particular order (i.e. Big or Little Endian?).
Byte Order Marks
Hi, A quick question relating to the Byte Order Mark of UCS-2. If its absent is it safe to assume any particular order (i.e. Big or Little Endian?). I am writing a function to rearrange from Big to little endian but without a byte order mark I'm not sure what the order is. Is there any specification I could refer to? Thanks. Tom Tomas McGuinness Consultant > -- > > University Technology Park* +353 21 4933 277 > Curraheen Rd, Cork *+353 21 4933 201 > * [EMAIL PROTECTED] > -- > > CMG Telecom Products Division > Product Development, Cork > -- > > > >
Re: Byte Order Marks
In a message dated 2001-04-10 3:04:09 Pacific Daylight Time, [EMAIL PROTECTED] writes: > When looking at a document would it be safe to assume that if you found any > of the following Byte Order Marks > *0xFFFE (UCS-2 Little Endian) > *0xFEFE (UCS-2 Big Endian) should be 0xFEFF > *0xEFBBBF (UTF-8) > That the document is encoded with that encoding format. That means that if I > found the first 3 octets to be EF BB EF could I assume I am dealing with a > UTF-8 Document. That is usually a safe assumption and a good practice, except that if the first two bytes are 0xFF 0xFE, you should check the next two to see if they are 0x00 0x00 (which would signify little-endian UCS-4). Also, think in terms of UTF-16, not UCS-2. > Apart from UTF and Unicode/UCS encoding formats do any other "legacy" > character sets use Byte Order Marks? Good question. I have not heard of any. To follow up, what about signatures that are not necessarily byte order marks? UTF-8 does not need a BOM, so the signature 0xEF 0xBB 0xBF is useful for the purpose Tomás mentioned, to indicate the encoding. Do any other character sets have such signatures? -Doug Ewell Fullerton, California
Byte Order Marks
Hi, When looking at a document would it be safe to assume that if you found any of the following Byte Order Marks * 0xFFFE (UCS-2 Little Endian) * 0xFEFE (UCS-2 Big Endian) * 0xEFBBBF (UTF-8) That the document is encoded with that encoding format. That means that if I found the first 3 octets to be EF BB EF could I assume I am dealing with a UTF-8 Document. Apart from UTF and Unicode/UCS encoding formats do any other "legacy" character sets use Byte Order Marks? Regrads, Tom. Tomas McGuinness Consultant > -- > > University Technology Park* +353 21 4933 277 > Curraheen Rd, Cork *+353 21 4933 201 > * [EMAIL PROTECTED] > -- > > CMG Telecom Products Division > Product Development, Cork > -- > > > >