subject:"\[WSG\] HTML Numeric and Named Entities"

Re: [WSG] HTML Numeric and Named Entities

2006-01-11 Thread Lachlan Hunt


liorean wrote:

On 11/01/06, Lachlan Hunt <[EMAIL PROTECTED]> wrote:

As far as character references in HTML are concerned, they have always
referred to the Unicode code points since HTML 2.0.


Ah. I just saw

 BASESET  "ISO 646:1983//CHARSET
   International Reference Version
   (IRV)//ESC 2/5 4/0"
 BASESET  "ISO Registration Number 100//CHARSET
   ECMA-94 Right Part of
   Latin Alphabet Nr. 1//ESC 2/13 4/1"

in HTML3.2 and

  BASESET  "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"


Oh, you're absolutely right.  My mistake, ISO-646 is US-ASCII, I forgot 
that it formally changed to ISO-10646 in HTML 3.2.  However, ISO-10646 
is mentioned in the prose of RFC 1866 several times and implementations 
are advised that numeric character references (beyond latin1) should 
reference those code points.  However, HTML 2 does formally use Latin 1 
(ISO-8859-1) for char refs, but these code points are a subset of 
ISO-10646 anyway.


--
Lachlan Hunt
http://lachy.id.au/

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**

Re: [WSG] HTML Numeric and Named Entities

2006-01-11 Thread liorean

On 11/01/06, Lachlan Hunt <[EMAIL PROTECTED]> wrote:
> liorean wrote:
> > Character references refer to Unicode code points independent of the
> > document encoding and character set. At least for HTML4 and XML, if
> > not for HTML3.2.
>
> As far as character references in HTML are concerned, they have always
> referred to the Unicode code points since HTML 2.0.

Ah. I just saw

 BASESET  "ISO 646:1983//CHARSET
   International Reference Version
   (IRV)//ESC 2/5 4/0"
 BASESET  "ISO Registration Number 100//CHARSET
   ECMA-94 Right Part of
   Latin Alphabet Nr. 1//ESC 2/13 4/1"

in HTML3.2 and

  BASESET  "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"

in HTML4.01 SGML declarations and assumed the first one (ISO-646) was
ANSI, the second one (ECMA-94) was the extended 8-bit characters
(latin-1) and the third one (ISO-10646) was Unicode. This assumption
was wrong?

> See my article:
> http://lachy.id.au/log/2005/10/char-refs
> (take note of the comments too, which contain a few corrections)

I read it months ago :)
--
David "liorean" Andersson
http://liorean.web-graphics.com/>
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list & getting help
**

Re: [WSG] HTML Numeric and Named Entities

2006-01-10 Thread Lachlan Hunt


liorean wrote:

On 11/01/06, Kat <[EMAIL PROTECTED]> wrote:

Is it safe to use the named references that formerly refered to the control 
characters?


Yes, it's safe to use the named entity references in HTML4, but it's 
easier to just use UTF-8 and type the actual characters instead. 
— (or any other entity reference) has never referred to a control 
character, you're getting confused by the fact that IE (and now every 
other HTML browser, for compatibility) incorrectly interprets character 
references from € to Ÿ (and their hex equivalents) as though 
the Document Character Set were Windows-1252.  This has never been 
defined in any standard, it is nothing more than widely implemented 
broken behaviour.



Multi level answer here:
- text/html: Should be perfectly safe.


Yes, it only depends on the availability of fonts and support for the 
characters used.  Not all characters are supported by every browser. 
For example, the character referred to by  (soft-hyphen) isn't 
supported by Mozilla yet.  Also, some older and obsolete browsers don't 
support all named entities.



- application/xhtml+xml: Should be, but isn't, safe except for the
five named entities of XML. Use decimal or hexadecimal character
references instead.
- application/xml: Only safe in validating user agents. Which doesn't
include browsers. So, use decimal or hexadecimal character references.


There is no difference between the handling of the MIME types, both 
require the use of a validating parser to handle named entity 
references.  The exception to the rule is that some browsers, such as 
Mozilla, despite not implementing a validating parser, may have a 
pseudo-DTD catalog containing just these entity references.  Mozilla 
uses this catalog when it encounters an XHTML DOCTYPE in an XML 
document, regardless of the MIME type.  (It works similarly for MathML too).



Character references refer to Unicode code points independent of the
document encoding and character set. At least for HTML4 and XML, if
not for HTML3.2.


As far as character references in HTML are concerned, they have always 
referred to the Unicode code points since HTML 2.0.


See my article:
http://lachy.id.au/log/2005/10/char-refs
(take note of the comments too, which contain a few corrections)

--
Lachlan Hunt
http://lachy.id.au/

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**

Re: [WSG] HTML Numeric and Named Entities

2006-01-10 Thread liorean

On 11/01/06, Kat <[EMAIL PROTECTED]> wrote:
> Is it safe to use the named references that formerly refered to the control 
> characters?

Multi level answer here:
- text/html: Should be perfectly safe.
- application/xhtml+xml: Should be, but isn't, safe except for the
five named entities of XML. Use decimal or hexadecimal character
references instead.
- application/xml: Only safe in validating user agents. Which doesn't
include browsers. So, use decimal or hexadecimal character references.

> If you have used these named references in the past, so long as you 
> have(update to) the correct character encoding,
> do these automatically refer to the correct entities?

Character references refer to Unicode code points independent of the
document encoding and character set. At least for HTML4 and XML, if
not for HTML3.2. Named entities just map to the corresponding
character references.
--
David "liorean" Andersson
http://liorean.web-graphics.com/>
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list & getting help
**

Re: [WSG] HTML Numeric and Named Entities

2006-01-10 Thread Juergen Auer

Hi Kat,


On 11 Jan 2006 at 10:29, Kat wrote:

>
> I am aware that — is an incorrect character entity for the em dash,
> that the correct entity is —.

#151 is definitivly wrong or very, very old.

http://www.sql-und-xml.de/unicode-database/latin-1-supplement.html

lists it as 'END OF GUARDED AREA'.

All dashes have their own category, 'Punctuation Dash'. They begin
with the standard '-', then some 8210 ... and other. See

http://www.sql-und-xml.de/unicode-database/pd.html

for the complete category.

> If you have used these named references in the past, so long as you 
> have(update to) the correct character encoding,
> do these automatically refer to the correct entities?
>

The Html version 4.0 is old, so most browsers may show the —
correct as #8212.

Best Regards
Juergen Auer



Jürgen Auer, www.sql-und-xml.de
Web-Datenbanken zum Mieten
Friedenstr. 37, 10 249 Berlin
Tel.: (030) 420 20 060
Fax: (030) 420 19 819
[EMAIL PROTECTED]
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list & getting help
**

[WSG] HTML Numeric and Named Entities

2006-01-10 Thread Kat



I am aware that — is an incorrect character entity for the em dash, 
that the correct entity is —.


But I was mucking about on the W3C Character entity references in HTML 4
http://www.w3.org/TR/REC-html40/sgml/entities.html

and noted that the named entity references are now linked to the decimal 
character entity reference, so that mdash refers to


—.


Is it safe to use the named references that formerly refered to the control 
characters?

If you have used these named references in the past, so long as you have(update to) the correct character encoding, 
do these automatically refer to the correct entities?


Kat




**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**

Re: [WSG] HTML Numeric and Named Entities

Re: [WSG] HTML Numeric and Named Entities

Re: [WSG] HTML Numeric and Named Entities

Re: [WSG] HTML Numeric and Named Entities

Re: [WSG] HTML Numeric and Named Entities

[WSG] HTML Numeric and Named Entities

6 matches

Site Navigation

Mail list logo

Footer information