Re: [WSG] choosing encoding, charset and using special characters

2004-11-22 Thread Manuel González Noriega
[UTF-8] it will be stored correctly and rendered as expected, as long
 as you remember to put  a meta http-equiv=content-type
 content=text/html; charset=utf-8 in your page's head. 

Actually, what you should be doing is getting the server to send the
right content-type header. Meta elements are not authoritative and in
fact lead many people to confusion when they are superceded by the
server headers.



-- 
Manuel 
a veces :) a veces :( 
pero siempre trabajando duro para Simplelógica: apariencia,
experiencia y comunicación en la web.
http://simplelogica.net # (+34) 985 22 12 65

¡Ah! y escribiendo en Logicola: http://simplelogica.net/logicola/
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



RE: [WSG] choosing encoding, charset and using special characters

2004-11-22 Thread Richard Ishida
Hello Julin,

At the W3C we wrote some material to answer your questions.  Please see:

http://www.w3.org/International/tutorials/tutorial-char-enc/

and 

http://www.w3.org/International/geo/html-tech/tech-character.html (still early 
draft!)

Please take a look (and let me know if there is any way we can improve the 
material).

Cheers,
RI



Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Dejan Kozina
 Sent: 22 November 2004 01:44
 To: [EMAIL PROTECTED]
 Subject: Re: [WSG] choosing encoding, charset and using 
 special characters
 
 
 
 Julin Landerreche wrote:
 
  1) Question: Is there a way to use special characters 
 directly in the 
  code?
 
 Two ways, actually, both requiring the pages being displayed as utf-8.
 One is writing the document with an editor capable of saving text as
 utf-8 (Unired is the one I like -
 http://www.esperanto.mv.ru/UniRed/ENG/), so that anything you 
 can key or paste in it will be stored correctly and rendered 
 as expected, as long as you remember to put  a meta 
 http-equiv=content-type
 content=text/html; charset=utf-8 in your page's head. The 
 other one is using a browser's form to input the text and 
 send it to some sort of CMS. Provided the page with the form 
 is utf-8 too, all modern browsers will convert the whole 
 stuff to utf-8 while uploading.
 
  2) I have seen a lot of webpages that directly use the special 
  character and dont code them as html entities. This pages are 
  displayed correctly. Question: Is this a good or bad 
 practice (to use 
  special characters in code, instead of entities)?
 
 According to my experience, it is OK to do it using Unicode, 
 otherwise you're relying on unwarranted assumptions regarding 
 the native codepage of the reader's machine (example: if you 
 use an  in your source it will probably be displayed as such 
 on any Spanish and generally western language OS, but it will 
 become a c on most Central European PCs).


As long as you declare the encoding of your page, and that encoding contains 
the character you want to display, it is better to use characters rather than 
escapes.  Apart from anything else, it improves maintainability and reduces 
bandwidth.


 
  3. In Google results, I found that those special characters arent 
  always correctly displayed.
 
 Google uses utf-8 for display, so your browser renders the 
 title as if it was encoded as such.
 
  Question:  Is there a way to force or override the encoding (not the
  charset) directly from the page code?
  I think that my textpattern managed pages should have ISO-8850-1 
  encoding.


You presumably mean ISO-8859-1 (rather than 8850).  Note that the W3C now 
serves its pages using utf-8.  It makes life a lot easier when you have 
multilingual pages or a number of pages in multiple languages.

 
 You can try using the numeric character references (written 
 as #xxx, where xxx is the decimal value of the character) or 
 the hexadecimal ones (written as #x, where  is the 
 hex value of the same). The complete list of references is at 
 ftp://ftp.unicode.org/Public/MAPPINGS/.


Note that the numeric value MUST be a Unicode code point value, whatever the 
encoding you are using. There are easier ways of finding a Unicode code point.  
For example, you could try my UniView utility at 
http://www.w3.org/People/Ishida/utilities.html 



 
  3. If I change to UTF-8...  wich are the advantages / disvantages?
 
 The main advantages are correct rendering in all modern 
 browsers - OSes, plus the possibility of hassle-free mixing 
 of characters from any charset on a  single page. Besides 
 this, it is rapidly becoming the standard encoding for all 
 sort of documents, on the web or otherwise.


As alluded to above.  Significant advantages also arise when receiving form 
data from multilingual pages and storing it centrally.  You don't need to 
figure out which encoding was used, and convert.

Hope that helps.
RI



 
 There are disavantages: Netscape 4.7 mostly doesn't recognize 
 the characters (except for the first 127 that are part of 
 ASCII) and MacOS 9 and below has sometimes a weird way of 
 displaying them.
 
 One final word about the document title: even if you place 
 the above meta before the title tag and tweak your server to 
 transmit the correct MIME type almost any browser around will 
 still use the OS's default 'window title' font for the title, 
 so it will be displayed as expected only if that font 
 contains the required glyphs (or shapes). It will display 
 correctly in Google listings, nevertheless.
 
 
 --
 Dejan Kozina Web Design Studio
 Dolina 346 (TS)
 I-34018 Trst/Trieste - Italy
 tel./fax: +39 040 228 436
 cell.: +39 348 7355 225
 http://www.kozina.com/
 e-mail: [EMAIL PROTECTED

Re: [WSG] choosing encoding, charset and using special characters

2004-11-22 Thread Dejan Kozina






Manuel Gonzlez Noriega wrote:

  [UTF-8] it will be stored correctly and rendered as expected, as long
  
  
as you remember to put  a meta http-equiv="content-type"
content="text/html; charset=utf-8" in your page's head. 

  
  
Actually, what you should be doing is getting the server to send the
right content-type header. Meta elements are not authoritative and in
fact lead many people to confusion when they are superceded by the
server headers.
  

You're right, of course. I still use to put the declaration in the meta
just in case somebody wants to save the page to the disk (and because
I still remember the good old days when I had no access to the server
config).
-- 
Dejan Kozina Web Design Studio
Dolina 346 (TS)
I-34018 Trst/Trieste - Italy
tel./fax: +39 040 228 436
cell.: +39 348 7355 225
http://www.kozina.com/
e-mail: [EMAIL PROTECTED]


begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;home:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard



RE: [WSG] choosing encoding, charset and using special characters

2004-11-22 Thread Richard Ishida
Hola Manuel, Dejan,

There are pros and cons to using the HTTP header to declare the encoding.
At the W3C we recommend that you always declare encoding inside the
document, whether or not you use the HTTP header.  Unlike something like
language declaration, the meta statement for character encoding declarations
is very widely recognised, and is the only in-document means to declare
encoding for HTML.  If serving XHTML you need to also consider the pros and
cons of using the XML declaration. For more detail, see 

http://www.w3.org/International/tutorials/tutorial-char-enc/

and 

http://www.w3.org/International/geo/html-tech/tech-character.html (still
early draft!)

Cheers,
RI



Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Manuel 
 González Noriega
 Sent: 22 November 2004 09:40
 To: [EMAIL PROTECTED]
 Subject: Re: [WSG] choosing encoding, charset and using 
 special characters
 
 [UTF-8] it will be stored correctly and rendered as expected, as long
  as you remember to put  a meta http-equiv=content-type
  content=text/html; charset=utf-8 in your page's head. 
 
 Actually, what you should be doing is getting the server to 
 send the right content-type header. Meta elements are not 
 authoritative and in fact lead many people to confusion when 
 they are superceded by the server headers.
 
 
 
 --
 Manuel
 a veces :) a veces :(
 pero siempre trabajando duro para Simplelógica: apariencia, 
 experiencia y comunicación en la web.
 http://simplelogica.net # (+34) 985 22 12 65
 
 ¡Ah! y escribiendo en Logicola: http://simplelogica.net/logicola/
 **
 The discussion list for  http://webstandardsgroup.org/
 
  See http://webstandardsgroup.org/mail/guidelines.cfm
  for some hints on posting to the list  getting help
 **
 

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] choosing encoding, charset and using special characters

2004-11-22 Thread Manuel González Noriega
On Mon, 22 Nov 2004 15:51:24 -, Richard Ishida [EMAIL PROTECTED] wrote:
 Hola Manuel, Dejan,
 
 There are pros and cons to using the HTTP header to declare the encoding.
 At the W3C we recommend that you always declare encoding inside the
 document, whether or not you use the HTTP header.  Unlike something like
 language declaration, the meta statement for character encoding declarations
 is very widely recognised, and is the only in-document means to declare
 encoding for HTML.  If serving XHTML you need to also consider the pros and
 cons of using the XML declaration. 

I stand corrected, I thought it was a much more clear scenario, where
server headers were The Right Way and meta was almost irrelevant. I'll
read those links carefully.


-- 
Manuel 
a veces :) a veces :( 
pero siempre trabajando duro para Simplelógica: apariencia,
experiencia y comunicación en la web.
http://simplelogica.net # (+34) 985 22 12 65

¡Ah! y escribiendo en Logicola: http://simplelogica.net/logicola/
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] choosing encoding, charset and using special characters

2004-11-21 Thread Dejan Kozina

Julin Landerreche wrote:
1) Question: Is there a way to use special characters directly in the 
code?
Two ways, actually, both requiring the pages being displayed as utf-8.
One is writing the document with an editor capable of saving text as
utf-8 (Unired is the one I like -
http://www.esperanto.mv.ru/UniRed/ENG/), so that anything you can key or
paste in it will be stored correctly and rendered as expected, as long
as you remember to put  a meta http-equiv=content-type
content=text/html; charset=utf-8 in your page's head. The other one
is using a browser's form to input the text and send it to some sort of
CMS. Provided the page with the form is utf-8 too, all modern browsers
will convert the whole stuff to utf-8 while uploading.
2) I have seen a lot of webpages that directly use the special 
character and dont code them as html entities. This pages are 
displayed correctly. Question: Is this a good or bad practice (to use 
special characters in code, instead of entities)?
According to my experience, it is OK to do it using Unicode, otherwise
you're relying on unwarranted assumptions regarding the native codepage
of the reader's machine (example: if you use an  in your source it will
probably be displayed as such on any Spanish and generally western
language OS, but it will become a  on most Central European PCs).
3. In Google results, I found that those special characters arent 
always correctly displayed.
Google uses utf-8 for display, so your browser renders the title as if
it was encoded as such.
Question:  Is there a way to force or override the encoding (not the 
charset) directly from the page code?
I think that my textpattern managed pages should have ISO-8850-1 
encoding.
You can try using the numeric character references (written as #xxx,
where xxx is the decimal value of the character) or the hexadecimal ones
(written as #x, where  is the hex value of the same). The
complete list of references is at ftp://ftp.unicode.org/Public/MAPPINGS/.
3. If I change to UTF-8...  wich are the advantages / disvantages?
The main advantages are correct rendering in all modern browsers - OSes,
plus the possibility of hassle-free mixing of characters from any
charset on a  single page. Besides this, it is rapidly becoming the
standard encoding for all sort of documents, on the web or otherwise.
There are disavantages: Netscape 4.7 mostly doesn't recognize the
characters (except for the first 127 that are part of ASCII) and MacOS 9
and below has sometimes a weird way of displaying them.
One final word about the document title: even if you place the above
meta before the title tag and tweak your server to transmit the correct
MIME type almost any browser around will still use the OS's default
'window title' font for the title, so it will be displayed as expected
only if that font contains the required glyphs (or shapes). It will
display correctly in Google listings, nevertheless.
--
Dejan Kozina Web Design Studio
Dolina 346 (TS)
I-34018 Trst/Trieste - Italy
tel./fax: +39 040 228 436
cell.: +39 348 7355 225
http://www.kozina.com/
e-mail: [EMAIL PROTECTED]


begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;home:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard



Re: [WSG] choosing encoding, charset and using special characters

2004-11-18 Thread Joseph Lindsay
Hi Julin,

We have issues in New Zealand with words in Mori.  In government we
are required to use macrons to indicate a long vowel sound in Mori
words.  The way we do it is to use UTF-8 as the document character set
encoded in 7-bit ASCII.  More info is available here:
http://www.tpk.govt.nz/using/macron_paper/index.html

For older/non-utf-8 capable browsers there are a couple of server
components that translates them back to the nearest ASCII character.
(http://tinyurl.com/3vw5v) although I think these only work for the 5
vowels.  They are GPL licenced so I guess they could be extended if
you needed

Joe


On Thu, 18 Nov 2004 17:51:09 -0300, Julin Landerreche
[EMAIL PROTECTED] wrote:
 Hi all,
 my name is Julin, i'm from Buenos Aires, Argentina.
 I have read this great tutorial
 (http://www.w3.org/International/tutorials/tutorial-char-enc/)
 recommended by WSG . The article makes things more clearly to me, but
 not totally..
 
 I feel this topic (choosing encoding and using special characters) is a
 difficult one to be understood by newbies in standards (as I am) and not
 newbies.
 But I think its a bit difficult for me, because I write in spanish, so
 I usually need to use special characters like ,  or .
 
 I have choose to use the ISO-8859-1 as charset for my webpages.
 And I use to code special characters with html entity references.
 Example:
  = eacute;  = uacute;  = ntilde; etc.
 
 Well, let me ask a few questions:
 
 1) Question: Is there a way to use special characters directly in the code?
 
 I would like to use directly  or  or , and not to code them as html
 entities references.
 Hey, dont think I'm a lazy boy: just suppose this situation: if I have a
 blog, I cannot expect that people (who post comments on my blog) knows
 how to use html entities referencies.
 Surely, they will prefer to type the special characters (, , ).
 I wont like that if they use special characters in a post, then the post
 cant correctly displayed (i.e. by showing those weird characters like
 the black ? or  or  ...)
 
 2) I have seen a lot of webpages that directly use the special character
 and dont code them as html entities. This pages are displayed correctly.
 Question: Is this a good or bad practice (to use special characters in
 code, instead of entities)?
 
 3. In Google results, I found that those special characters arent always
 correctly displayed.
 Example: my webpage title in a two Google searchs result.
 
 i). servicio tcnico especializado para msicos  (b!)
 a. encoding: UTF-8
 b. charset: ISO-8859-1
 (from a page managed by Textpattern)
 
 ii). servicio tcnico especializado para msicos
 a. encoding: ISO-5-8859-1
 b. charset: ISO-8859-1
 (from a page managed by other script, or from hardcoded pages)
 
 Question:  Is there a way to force or override the encoding (not the
 charset) directly from the page code?
 I think that my textpattern managed pages should have ISO-8850-1 encoding.
 
 (This is a question I also must do in textpattern forums, because I dont
 know why pages managed by TXP have UTF-8 encoding, as there isnt any any
 line in my whole site headers that shows utf-8)
 
 3. If I change to UTF-8...
 a. wich are the advantages / disvantages?
 b. I have test it in few of my pages - all special characters (not
 encoded as entities) are incorrectly displayed... yucks!
 --
 
 Well, I think that's all, just to start.
 I would like to read more resources about encoding and charset, and also
 read experiences from the people of this list.
 
 Y tambin me gustara leer experiencias de gente que habla (y escribe
 pginas) en espaol, hay alguien en la lista?
 
 Gracias a todos! Thank you! Excuse my poor english!
 Julin Landerreche
 Buenos Aires, Argentina
 www.midi-midi.com.ar (not finished yet)
 **
 The discussion list for  http://webstandardsgroup.org/
 
  See http://webstandardsgroup.org/mail/guidelines.cfm
  for some hints on posting to the list  getting help
 **
 
 


-- 
Gmail invites - just ask nicely
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] choosing encoding, charset and using special characters

2004-11-18 Thread Matthew






Hi Julin,

I think it's even a difficult article for techies, because there's
little good advice. So here's some good advice,
http://www.joelonsoftware.com/articles/Unicode.html

"In this article I'll fill you in on exactly
what every working
programmer
should know. All that stuff about "plain text = ascii = characters are
8 bits" is not only wrong, it's hopelessly wrong, and if you're still
programming that way, you're not much better than a medical doctor who
doesn't believe in germs. Please do not write another line of code
until you finish reading this article."


//

"1) Question: Is there a way to use special characters directly in
the code?
"

If those characters are in 8859-1, then you can use them. But because
8859-1 uses that range along with lots of other encodings some software
(like Google) can get confused when it tries to merge multiple
charsets. That might be the Google problem you were seeing.


"2) I have seen a lot of webpages that directly use the special
character and dont code them as html entities. This pages are displayed
correctly.

Question: Is this a good or bad practice (to use special characters in
code, instead of entities)?
"

Character entities can use an ASCII encoding, whereas encoded "special
characters" use the file encoding (regardless of whether they're
Unicode or 8859). So if your software supports Unicode encoding (Eg, a
UTF-8 encoded file with 'extended characters' doesn't get mangled) then
it doesn't really matter.

There are very few browsers that don't display unicode correctly when
given encoded characters or entities. When browsers aren't Unicode
aware they tend to display unknown entities as question marks, whereas
unknown encoded characters come out as garbled text, if that matters.

So it seems that it's mostly to do with your internal software support,
rather than browsers.


"3. In Google results, I found that those special characters arent
always correctly displayed."

It seems that Google uses Unicode (it has the metatag, the special
characters are Unicode encoded rather than entities). If you do a
Google search for "macron site:e-government.govt.nz" you'll see that
the Maori language is displaying correctly in Google. So it seems that
Google doesn't have a problem with Unicode, but maybe it has a problem
with merging multiple 'extended-ascii' charsets on a single page.

I think the general opinion is that unless you've got a legacy system
then Unicode, via UTF-8, is where people should already be.



.Matthew Cruickshank
http://holloway.co.nz/