Thank you Otto. Sorry for delay in replying. I spent the entire Sunday replying Jaques twins.
You are absolutely right about choice between ISO-8859-1 and UTF-8. I shouldn't have said 'using ISO-8859-1 is advantageous over UTF-8' It is efficient if your pages are written in a language that uses single byte codepoints. When you mix multi-byte based codepoints, like you said, the ideal is to have them in their raw form. But in practice, this is not as easy as we think. Actually, the trade-off is not great for me because I use only little non-SBCS characters. Each 2-byte character would end up as six bytes in a Hex char entity. If you want to control the look of your web site, then you probably have to have expensive software to do it. As for poor me, I use CSS, JavaScript and HTML inside HTML-Kit. HTML5 assumes UTF-8 as the character set if you do not declare one explicitly. My current pages are in HTML 4. As I said, I use HTML-Kit (and Tools). If I have raw Unicode Sinhala in the HTML or Javascript, it messes them and gives you character-not-found for them on the web page. I must have character entities if I need the comfort of HTML-Kit. There are web sites that help you process your SBCS and multi-byte mixed text to make character entities for non Latin-1 characters. I used them when making my only page that has them (Liyanna). Stop and think why there are such websites. (Search text to unicode). The world outside Latin-1 is a harsh one. If I want to have raw Unicode Sinhala, PTS Pali or IAST Sanskrit, I have to use Notepad instead of HTML-Kit. It is hard to code without color-coded text. I wanted to see how hard it is to edit a page in Notepad. So I made a copy of my LIYANNA page and replaced the character entities I used for Unicode Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced me to save the file in UTF-8 format. I ran it through W3C Validator. It passed HTML5 test with the following warning: [image: Warning] Byte-Order Mark found in UTF-8 File. The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported. The BOM is the first character of the file. There are myriad hoops that non-Latin users go through to do things that we routinely do. This problem I saw right at the inception. I already know why romanizing is so good. Don't you? UTF-8 encoding is this RFC: http://www.ietf.org/rfc/rfc2279.txt This is the table it gives on the way UTF-8 encoding works: 0000 0000-0000 007F 0xxxxxxx <==== ASCII 0000 0080-0000 07FF 110xxxxx 10xxxxxx <=== Latin -1 plus higher 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx <== Unicode Sinhala 0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx Observe that Latin 'a' transforms from UCS-2 to two coded bytes with UTF-8 and Unicode Sinhala Ayanna goes from two to three. Unicode Sinhala: 0D80 - 0DFF a = Hex 61 = Bin 0110 0001 -> UTF-8 Template: 110xxxxx 10xxxxxx UTF-8 Encoding: 11000001 10100001 = Hex C1 A1 ayanna = Hex 0D85 = Bin 0000 11011000 0101 -> UTF-8 Template: 1110xxxx 10xxxxxx 10xxxxxx UTF-8 encoding: 11100000 10110110 10000101 = Hex E0 B6 85 Thanks for your input. It is appreciated. On Wed, Jul 4, 2012 at 2:25 PM, Otto Stolz <[email protected]>wrote: > Hello Naena Guru, > > on 2012-07-04, you wrote: > >> The purpose of >> declaring the character set as iso-8859-1 than utf-8 is to avoid doubling >> and trebling the size of the page by utf-8. I think, if you have >> characters >> outside iso-8859-1 and declare the page as such, you get >> Character-not-found for those locations. (I may be wrong). >> > > You are wrong, indeed. > > If you declare your page as ISO-8859-1, every octet > (aka byte) in your page will be understood as a Latin-1 > character; hence you cannot have any other character > in your page. So, your notion of “characters outside > iso-8859-1” is completely meaningless. > > If you declare your page as UTF-8, you can have > any Unicode character (even PUA characters) in > your page. > > Regardless of the charset declaration of your page, > you can include both Numeric Character References > and Character Entity References in your HTML source, > cf., e.g., > <http://www.w3.org/TR/html401/**charset.html#h-5.3<http://www.w3.org/TR/html401/charset.html#h-5.3> > >. > These may refer to any Unicode character, whatsoever. > However, they will take considerably more storage space > (and transmission bandwidth) than the UTF-8 encoded > characters would take. > > Good luck, > Otto Stolz > > >

