Re: [WSG] HTML Numeric and Named Entities
liorean wrote: On 11/01/06, Lachlan Hunt <[EMAIL PROTECTED]> wrote: As far as character references in HTML are concerned, they have always referred to the Unicode code points since HTML 2.0. Ah. I just saw BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" in HTML3.2 and BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" Oh, you're absolutely right. My mistake, ISO-646 is US-ASCII, I forgot that it formally changed to ISO-10646 in HTML 3.2. However, ISO-10646 is mentioned in the prose of RFC 1866 several times and implementations are advised that numeric character references (beyond latin1) should reference those code points. However, HTML 2 does formally use Latin 1 (ISO-8859-1) for char refs, but these code points are a subset of ISO-10646 anyway. -- Lachlan Hunt http://lachy.id.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] HTML Numeric and Named Entities
On 11/01/06, Lachlan Hunt <[EMAIL PROTECTED]> wrote: > liorean wrote: > > Character references refer to Unicode code points independent of the > > document encoding and character set. At least for HTML4 and XML, if > > not for HTML3.2. > > As far as character references in HTML are concerned, they have always > referred to the Unicode code points since HTML 2.0. Ah. I just saw BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" in HTML3.2 and BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" in HTML4.01 SGML declarations and assumed the first one (ISO-646) was ANSI, the second one (ECMA-94) was the extended 8-bit characters (latin-1) and the third one (ISO-10646) was Unicode. This assumption was wrong? > See my article: > http://lachy.id.au/log/2005/10/char-refs > (take note of the comments too, which contain a few corrections) I read it months ago :) -- David "liorean" Andersson http://liorean.web-graphics.com/> ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] HTML Numeric and Named Entities
liorean wrote: On 11/01/06, Kat <[EMAIL PROTECTED]> wrote: Is it safe to use the named references that formerly refered to the control characters? Yes, it's safe to use the named entity references in HTML4, but it's easier to just use UTF-8 and type the actual characters instead. — (or any other entity reference) has never referred to a control character, you're getting confused by the fact that IE (and now every other HTML browser, for compatibility) incorrectly interprets character references from € to Ÿ (and their hex equivalents) as though the Document Character Set were Windows-1252. This has never been defined in any standard, it is nothing more than widely implemented broken behaviour. Multi level answer here: - text/html: Should be perfectly safe. Yes, it only depends on the availability of fonts and support for the characters used. Not all characters are supported by every browser. For example, the character referred to by (soft-hyphen) isn't supported by Mozilla yet. Also, some older and obsolete browsers don't support all named entities. - application/xhtml+xml: Should be, but isn't, safe except for the five named entities of XML. Use decimal or hexadecimal character references instead. - application/xml: Only safe in validating user agents. Which doesn't include browsers. So, use decimal or hexadecimal character references. There is no difference between the handling of the MIME types, both require the use of a validating parser to handle named entity references. The exception to the rule is that some browsers, such as Mozilla, despite not implementing a validating parser, may have a pseudo-DTD catalog containing just these entity references. Mozilla uses this catalog when it encounters an XHTML DOCTYPE in an XML document, regardless of the MIME type. (It works similarly for MathML too). Character references refer to Unicode code points independent of the document encoding and character set. At least for HTML4 and XML, if not for HTML3.2. As far as character references in HTML are concerned, they have always referred to the Unicode code points since HTML 2.0. See my article: http://lachy.id.au/log/2005/10/char-refs (take note of the comments too, which contain a few corrections) -- Lachlan Hunt http://lachy.id.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] HTML Numeric and Named Entities
On 11/01/06, Kat <[EMAIL PROTECTED]> wrote: > Is it safe to use the named references that formerly refered to the control > characters? Multi level answer here: - text/html: Should be perfectly safe. - application/xhtml+xml: Should be, but isn't, safe except for the five named entities of XML. Use decimal or hexadecimal character references instead. - application/xml: Only safe in validating user agents. Which doesn't include browsers. So, use decimal or hexadecimal character references. > If you have used these named references in the past, so long as you > have(update to) the correct character encoding, > do these automatically refer to the correct entities? Character references refer to Unicode code points independent of the document encoding and character set. At least for HTML4 and XML, if not for HTML3.2. Named entities just map to the corresponding character references. -- David "liorean" Andersson http://liorean.web-graphics.com/> ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] HTML Numeric and Named Entities
Hi Kat, On 11 Jan 2006 at 10:29, Kat wrote: > > I am aware that — is an incorrect character entity for the em dash, > that the correct entity is —. #151 is definitivly wrong or very, very old. http://www.sql-und-xml.de/unicode-database/latin-1-supplement.html lists it as 'END OF GUARDED AREA'. All dashes have their own category, 'Punctuation Dash'. They begin with the standard '-', then some 8210 ... and other. See http://www.sql-und-xml.de/unicode-database/pd.html for the complete category. > If you have used these named references in the past, so long as you > have(update to) the correct character encoding, > do these automatically refer to the correct entities? > The Html version 4.0 is old, so most browsers may show the — correct as #8212. Best Regards Juergen Auer Jürgen Auer, www.sql-und-xml.de Web-Datenbanken zum Mieten Friedenstr. 37, 10 249 Berlin Tel.: (030) 420 20 060 Fax: (030) 420 19 819 [EMAIL PROTECTED] ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
[WSG] HTML Numeric and Named Entities
I am aware that — is an incorrect character entity for the em dash, that the correct entity is —. But I was mucking about on the W3C Character entity references in HTML 4 http://www.w3.org/TR/REC-html40/sgml/entities.html and noted that the named entity references are now linked to the decimal character entity reference, so that mdash refers to —. Is it safe to use the named references that formerly refered to the control characters? If you have used these named references in the past, so long as you have(update to) the correct character encoding, do these automatically refer to the correct entities? Kat ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **