Re: [WSG] choosing encoding, charset and using special characters
[UTF-8] it will be stored correctly and rendered as expected, as long as you remember to put a meta http-equiv=content-type content=text/html; charset=utf-8 in your page's head. Actually, what you should be doing is getting the server to send the right content-type header. Meta elements are not authoritative and in fact lead many people to confusion when they are superceded by the server headers. -- Manuel a veces :) a veces :( pero siempre trabajando duro para Simplelógica: apariencia, experiencia y comunicación en la web. http://simplelogica.net # (+34) 985 22 12 65 ¡Ah! y escribiendo en Logicola: http://simplelogica.net/logicola/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] choosing encoding, charset and using special characters
Hello Julin, At the W3C we wrote some material to answer your questions. Please see: http://www.w3.org/International/tutorials/tutorial-char-enc/ and http://www.w3.org/International/geo/html-tech/tech-character.html (still early draft!) Please take a look (and let me know if there is any way we can improve the material). Cheers, RI Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dejan Kozina Sent: 22 November 2004 01:44 To: [EMAIL PROTECTED] Subject: Re: [WSG] choosing encoding, charset and using special characters Julin Landerreche wrote: 1) Question: Is there a way to use special characters directly in the code? Two ways, actually, both requiring the pages being displayed as utf-8. One is writing the document with an editor capable of saving text as utf-8 (Unired is the one I like - http://www.esperanto.mv.ru/UniRed/ENG/), so that anything you can key or paste in it will be stored correctly and rendered as expected, as long as you remember to put a meta http-equiv=content-type content=text/html; charset=utf-8 in your page's head. The other one is using a browser's form to input the text and send it to some sort of CMS. Provided the page with the form is utf-8 too, all modern browsers will convert the whole stuff to utf-8 while uploading. 2) I have seen a lot of webpages that directly use the special character and dont code them as html entities. This pages are displayed correctly. Question: Is this a good or bad practice (to use special characters in code, instead of entities)? According to my experience, it is OK to do it using Unicode, otherwise you're relying on unwarranted assumptions regarding the native codepage of the reader's machine (example: if you use an in your source it will probably be displayed as such on any Spanish and generally western language OS, but it will become a c on most Central European PCs). As long as you declare the encoding of your page, and that encoding contains the character you want to display, it is better to use characters rather than escapes. Apart from anything else, it improves maintainability and reduces bandwidth. 3. In Google results, I found that those special characters arent always correctly displayed. Google uses utf-8 for display, so your browser renders the title as if it was encoded as such. Question: Is there a way to force or override the encoding (not the charset) directly from the page code? I think that my textpattern managed pages should have ISO-8850-1 encoding. You presumably mean ISO-8859-1 (rather than 8850). Note that the W3C now serves its pages using utf-8. It makes life a lot easier when you have multilingual pages or a number of pages in multiple languages. You can try using the numeric character references (written as #xxx, where xxx is the decimal value of the character) or the hexadecimal ones (written as #x, where is the hex value of the same). The complete list of references is at ftp://ftp.unicode.org/Public/MAPPINGS/. Note that the numeric value MUST be a Unicode code point value, whatever the encoding you are using. There are easier ways of finding a Unicode code point. For example, you could try my UniView utility at http://www.w3.org/People/Ishida/utilities.html 3. If I change to UTF-8... wich are the advantages / disvantages? The main advantages are correct rendering in all modern browsers - OSes, plus the possibility of hassle-free mixing of characters from any charset on a single page. Besides this, it is rapidly becoming the standard encoding for all sort of documents, on the web or otherwise. As alluded to above. Significant advantages also arise when receiving form data from multilingual pages and storing it centrally. You don't need to figure out which encoding was used, and convert. Hope that helps. RI There are disavantages: Netscape 4.7 mostly doesn't recognize the characters (except for the first 127 that are part of ASCII) and MacOS 9 and below has sometimes a weird way of displaying them. One final word about the document title: even if you place the above meta before the title tag and tweak your server to transmit the correct MIME type almost any browser around will still use the OS's default 'window title' font for the title, so it will be displayed as expected only if that font contains the required glyphs (or shapes). It will display correctly in Google listings, nevertheless. -- Dejan Kozina Web Design Studio Dolina 346 (TS) I-34018 Trst/Trieste - Italy tel./fax: +39 040 228 436 cell.: +39 348 7355 225 http://www.kozina.com/ e-mail: [EMAIL PROTECTED
Re: [WSG] choosing encoding, charset and using special characters
Manuel Gonzlez Noriega wrote: [UTF-8] it will be stored correctly and rendered as expected, as long as you remember to put a meta http-equiv="content-type" content="text/html; charset=utf-8" in your page's head. Actually, what you should be doing is getting the server to send the right content-type header. Meta elements are not authoritative and in fact lead many people to confusion when they are superceded by the server headers. You're right, of course. I still use to put the declaration in the meta just in case somebody wants to save the page to the disk (and because I still remember the good old days when I had no access to the server config). -- Dejan Kozina Web Design Studio Dolina 346 (TS) I-34018 Trst/Trieste - Italy tel./fax: +39 040 228 436 cell.: +39 348 7355 225 http://www.kozina.com/ e-mail: [EMAIL PROTECTED] begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;home:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard
RE: [WSG] choosing encoding, charset and using special characters
Hola Manuel, Dejan, There are pros and cons to using the HTTP header to declare the encoding. At the W3C we recommend that you always declare encoding inside the document, whether or not you use the HTTP header. Unlike something like language declaration, the meta statement for character encoding declarations is very widely recognised, and is the only in-document means to declare encoding for HTML. If serving XHTML you need to also consider the pros and cons of using the XML declaration. For more detail, see http://www.w3.org/International/tutorials/tutorial-char-enc/ and http://www.w3.org/International/geo/html-tech/tech-character.html (still early draft!) Cheers, RI Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Manuel González Noriega Sent: 22 November 2004 09:40 To: [EMAIL PROTECTED] Subject: Re: [WSG] choosing encoding, charset and using special characters [UTF-8] it will be stored correctly and rendered as expected, as long as you remember to put a meta http-equiv=content-type content=text/html; charset=utf-8 in your page's head. Actually, what you should be doing is getting the server to send the right content-type header. Meta elements are not authoritative and in fact lead many people to confusion when they are superceded by the server headers. -- Manuel a veces :) a veces :( pero siempre trabajando duro para Simplelógica: apariencia, experiencia y comunicación en la web. http://simplelogica.net # (+34) 985 22 12 65 ¡Ah! y escribiendo en Logicola: http://simplelogica.net/logicola/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help ** ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] choosing encoding, charset and using special characters
On Mon, 22 Nov 2004 15:51:24 -, Richard Ishida [EMAIL PROTECTED] wrote: Hola Manuel, Dejan, There are pros and cons to using the HTTP header to declare the encoding. At the W3C we recommend that you always declare encoding inside the document, whether or not you use the HTTP header. Unlike something like language declaration, the meta statement for character encoding declarations is very widely recognised, and is the only in-document means to declare encoding for HTML. If serving XHTML you need to also consider the pros and cons of using the XML declaration. I stand corrected, I thought it was a much more clear scenario, where server headers were The Right Way and meta was almost irrelevant. I'll read those links carefully. -- Manuel a veces :) a veces :( pero siempre trabajando duro para Simplelógica: apariencia, experiencia y comunicación en la web. http://simplelogica.net # (+34) 985 22 12 65 ¡Ah! y escribiendo en Logicola: http://simplelogica.net/logicola/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] choosing encoding, charset and using special characters
Julin Landerreche wrote: 1) Question: Is there a way to use special characters directly in the code? Two ways, actually, both requiring the pages being displayed as utf-8. One is writing the document with an editor capable of saving text as utf-8 (Unired is the one I like - http://www.esperanto.mv.ru/UniRed/ENG/), so that anything you can key or paste in it will be stored correctly and rendered as expected, as long as you remember to put a meta http-equiv=content-type content=text/html; charset=utf-8 in your page's head. The other one is using a browser's form to input the text and send it to some sort of CMS. Provided the page with the form is utf-8 too, all modern browsers will convert the whole stuff to utf-8 while uploading. 2) I have seen a lot of webpages that directly use the special character and dont code them as html entities. This pages are displayed correctly. Question: Is this a good or bad practice (to use special characters in code, instead of entities)? According to my experience, it is OK to do it using Unicode, otherwise you're relying on unwarranted assumptions regarding the native codepage of the reader's machine (example: if you use an in your source it will probably be displayed as such on any Spanish and generally western language OS, but it will become a on most Central European PCs). 3. In Google results, I found that those special characters arent always correctly displayed. Google uses utf-8 for display, so your browser renders the title as if it was encoded as such. Question: Is there a way to force or override the encoding (not the charset) directly from the page code? I think that my textpattern managed pages should have ISO-8850-1 encoding. You can try using the numeric character references (written as #xxx, where xxx is the decimal value of the character) or the hexadecimal ones (written as #x, where is the hex value of the same). The complete list of references is at ftp://ftp.unicode.org/Public/MAPPINGS/. 3. If I change to UTF-8... wich are the advantages / disvantages? The main advantages are correct rendering in all modern browsers - OSes, plus the possibility of hassle-free mixing of characters from any charset on a single page. Besides this, it is rapidly becoming the standard encoding for all sort of documents, on the web or otherwise. There are disavantages: Netscape 4.7 mostly doesn't recognize the characters (except for the first 127 that are part of ASCII) and MacOS 9 and below has sometimes a weird way of displaying them. One final word about the document title: even if you place the above meta before the title tag and tweak your server to transmit the correct MIME type almost any browser around will still use the OS's default 'window title' font for the title, so it will be displayed as expected only if that font contains the required glyphs (or shapes). It will display correctly in Google listings, nevertheless. -- Dejan Kozina Web Design Studio Dolina 346 (TS) I-34018 Trst/Trieste - Italy tel./fax: +39 040 228 436 cell.: +39 348 7355 225 http://www.kozina.com/ e-mail: [EMAIL PROTECTED] begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;home:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard
Re: [WSG] choosing encoding, charset and using special characters
Hi Julin, We have issues in New Zealand with words in Mori. In government we are required to use macrons to indicate a long vowel sound in Mori words. The way we do it is to use UTF-8 as the document character set encoded in 7-bit ASCII. More info is available here: http://www.tpk.govt.nz/using/macron_paper/index.html For older/non-utf-8 capable browsers there are a couple of server components that translates them back to the nearest ASCII character. (http://tinyurl.com/3vw5v) although I think these only work for the 5 vowels. They are GPL licenced so I guess they could be extended if you needed Joe On Thu, 18 Nov 2004 17:51:09 -0300, Julin Landerreche [EMAIL PROTECTED] wrote: Hi all, my name is Julin, i'm from Buenos Aires, Argentina. I have read this great tutorial (http://www.w3.org/International/tutorials/tutorial-char-enc/) recommended by WSG . The article makes things more clearly to me, but not totally.. I feel this topic (choosing encoding and using special characters) is a difficult one to be understood by newbies in standards (as I am) and not newbies. But I think its a bit difficult for me, because I write in spanish, so I usually need to use special characters like , or . I have choose to use the ISO-8859-1 as charset for my webpages. And I use to code special characters with html entity references. Example: = eacute; = uacute; = ntilde; etc. Well, let me ask a few questions: 1) Question: Is there a way to use special characters directly in the code? I would like to use directly or or , and not to code them as html entities references. Hey, dont think I'm a lazy boy: just suppose this situation: if I have a blog, I cannot expect that people (who post comments on my blog) knows how to use html entities referencies. Surely, they will prefer to type the special characters (, , ). I wont like that if they use special characters in a post, then the post cant correctly displayed (i.e. by showing those weird characters like the black ? or or ...) 2) I have seen a lot of webpages that directly use the special character and dont code them as html entities. This pages are displayed correctly. Question: Is this a good or bad practice (to use special characters in code, instead of entities)? 3. In Google results, I found that those special characters arent always correctly displayed. Example: my webpage title in a two Google searchs result. i). servicio tcnico especializado para msicos (b!) a. encoding: UTF-8 b. charset: ISO-8859-1 (from a page managed by Textpattern) ii). servicio tcnico especializado para msicos a. encoding: ISO-5-8859-1 b. charset: ISO-8859-1 (from a page managed by other script, or from hardcoded pages) Question: Is there a way to force or override the encoding (not the charset) directly from the page code? I think that my textpattern managed pages should have ISO-8850-1 encoding. (This is a question I also must do in textpattern forums, because I dont know why pages managed by TXP have UTF-8 encoding, as there isnt any any line in my whole site headers that shows utf-8) 3. If I change to UTF-8... a. wich are the advantages / disvantages? b. I have test it in few of my pages - all special characters (not encoded as entities) are incorrectly displayed... yucks! -- Well, I think that's all, just to start. I would like to read more resources about encoding and charset, and also read experiences from the people of this list. Y tambin me gustara leer experiencias de gente que habla (y escribe pginas) en espaol, hay alguien en la lista? Gracias a todos! Thank you! Excuse my poor english! Julin Landerreche Buenos Aires, Argentina www.midi-midi.com.ar (not finished yet) ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help ** -- Gmail invites - just ask nicely ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] choosing encoding, charset and using special characters
Hi Julin, I think it's even a difficult article for techies, because there's little good advice. So here's some good advice, http://www.joelonsoftware.com/articles/Unicode.html "In this article I'll fill you in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs. Please do not write another line of code until you finish reading this article." // "1) Question: Is there a way to use special characters directly in the code? " If those characters are in 8859-1, then you can use them. But because 8859-1 uses that range along with lots of other encodings some software (like Google) can get confused when it tries to merge multiple charsets. That might be the Google problem you were seeing. "2) I have seen a lot of webpages that directly use the special character and dont code them as html entities. This pages are displayed correctly. Question: Is this a good or bad practice (to use special characters in code, instead of entities)? " Character entities can use an ASCII encoding, whereas encoded "special characters" use the file encoding (regardless of whether they're Unicode or 8859). So if your software supports Unicode encoding (Eg, a UTF-8 encoded file with 'extended characters' doesn't get mangled) then it doesn't really matter. There are very few browsers that don't display unicode correctly when given encoded characters or entities. When browsers aren't Unicode aware they tend to display unknown entities as question marks, whereas unknown encoded characters come out as garbled text, if that matters. So it seems that it's mostly to do with your internal software support, rather than browsers. "3. In Google results, I found that those special characters arent always correctly displayed." It seems that Google uses Unicode (it has the metatag, the special characters are Unicode encoded rather than entities). If you do a Google search for "macron site:e-government.govt.nz" you'll see that the Maori language is displaying correctly in Google. So it seems that Google doesn't have a problem with Unicode, but maybe it has a problem with merging multiple 'extended-ascii' charsets on a single page. I think the general opinion is that unless you've got a legacy system then Unicode, via UTF-8, is where people should already be. .Matthew Cruickshank http://holloway.co.nz/