Scott Eade wrote:

Okay, I'll answer my own question:
1. The character /u2019 will not be converted to a character reference when
UTF-8 is used


Correct

(it will use two bytes and will not be displayed correctly in
applications that do not correctly deal with UTF-8 - e.g. Windows notepad).

Notepad _can_ display Unicode characters from files that have been saved as UTF-8, as long as the font you use on Notepad can display that character. At work, we have lots of files that contain Chinese characters that are saved as UTF-8, and I use the SimSun or SimHei font to view those files, including XML files in UTF-8.

When you do a "Save As", you have the option to save a file as UTF-8 ( and UTF-16 I think ). Notepad also puts a BOM ( Byte Order Marking ) on front of the file. You can see this BOM through a hex editor.

2. In the cases where character references are used an editing component is
causing them to be encoded - the component is not being used in the places
where the characters are not encoded.
3. Windows file encodings are a PITA.

The default is called windows-1252 in most cases at least ( Will be different of course for someone running Windows Thai ).
It's _not_ the same as iso-8859-1. You can think of windows-1252 as a superset of iso-8859-1.


http://czyborra.com/charsets/iso8859.html

On some websites, what were supposed to be "smart quote" characters appear as questions marks or as another funny character on your non-IE browser.
It turns out that the HTTP header for the webpage was advertised as iso-8859-1, but the file itself was encoded in windows-1252.


4. I know more now than I did before.

Sorry for the noise.

Scott





--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to