Re: [WSG] HTML Numeric and Named Entities

2006-01-11 Thread liorean
On 11/01/06, Lachlan Hunt [EMAIL PROTECTED] wrote:
 liorean wrote:
  Character references refer to Unicode code points independent of the
  document encoding and character set. At least for HTML4 and XML, if
  not for HTML3.2.

 As far as character references in HTML are concerned, they have always
 referred to the Unicode code points since HTML 2.0.

Ah. I just saw

 BASESET  ISO 646:1983//CHARSET
   International Reference Version
   (IRV)//ESC 2/5 4/0
 BASESET  ISO Registration Number 100//CHARSET
   ECMA-94 Right Part of
   Latin Alphabet Nr. 1//ESC 2/13 4/1

in HTML3.2 and

  BASESET  ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6

in HTML4.01 SGML declarations and assumed the first one (ISO-646) was
ANSI, the second one (ECMA-94) was the extended 8-bit characters
(latin-1) and the third one (ISO-10646) was Unicode. This assumption
was wrong?

 See my article:
 http://lachy.id.au/log/2005/10/char-refs
 (take note of the comments too, which contain a few corrections)

I read it months ago :)
--
David liorean Andersson
uri:http://liorean.web-graphics.com/
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] HTML Numeric and Named Entities

2006-01-11 Thread Lachlan Hunt

liorean wrote:

On 11/01/06, Lachlan Hunt [EMAIL PROTECTED] wrote:

As far as character references in HTML are concerned, they have always
referred to the Unicode code points since HTML 2.0.


Ah. I just saw

 BASESET  ISO 646:1983//CHARSET
   International Reference Version
   (IRV)//ESC 2/5 4/0
 BASESET  ISO Registration Number 100//CHARSET
   ECMA-94 Right Part of
   Latin Alphabet Nr. 1//ESC 2/13 4/1

in HTML3.2 and

  BASESET  ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6


Oh, you're absolutely right.  My mistake, ISO-646 is US-ASCII, I forgot 
that it formally changed to ISO-10646 in HTML 3.2.  However, ISO-10646 
is mentioned in the prose of RFC 1866 several times and implementations 
are advised that numeric character references (beyond latin1) should 
reference those code points.  However, HTML 2 does formally use Latin 1 
(ISO-8859-1) for char refs, but these code points are a subset of 
ISO-10646 anyway.


--
Lachlan Hunt
http://lachy.id.au/

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**



Re: [WSG] HTML Numeric and Named Entities

2006-01-10 Thread Juergen Auer
Hi Kat,


On 11 Jan 2006 at 10:29, Kat wrote:


 I am aware that #151; is an incorrect character entity for the em dash,
 that the correct entity is #8212;.

#151 is definitivly wrong or very, very old.

http://www.sql-und-xml.de/unicode-database/latin-1-supplement.html

lists it as 'END OF GUARDED AREA'.

All dashes have their own category, 'Punctuation Dash'. They begin
with the standard '-', then some 8210 ... and other. See

http://www.sql-und-xml.de/unicode-database/pd.html

for the complete category.

 If you have used these named references in the past, so long as you 
 have(update to) the correct character encoding,
 do these automatically refer to the correct entities?


The Html version 4.0 is old, so most browsers may show the mdash;
correct as #8212.

Best Regards
Juergen Auer



Jürgen Auer, www.sql-und-xml.de
Web-Datenbanken zum Mieten
Friedenstr. 37, 10 249 Berlin
Tel.: (030) 420 20 060
Fax: (030) 420 19 819
[EMAIL PROTECTED]
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] HTML Numeric and Named Entities

2006-01-10 Thread liorean
On 11/01/06, Kat [EMAIL PROTECTED] wrote:
 Is it safe to use the named references that formerly refered to the control 
 characters?

Multi level answer here:
- text/html: Should be perfectly safe.
- application/xhtml+xml: Should be, but isn't, safe except for the
five named entities of XML. Use decimal or hexadecimal character
references instead.
- application/xml: Only safe in validating user agents. Which doesn't
include browsers. So, use decimal or hexadecimal character references.

 If you have used these named references in the past, so long as you 
 have(update to) the correct character encoding,
 do these automatically refer to the correct entities?

Character references refer to Unicode code points independent of the
document encoding and character set. At least for HTML4 and XML, if
not for HTML3.2. Named entities just map to the corresponding
character references.
--
David liorean Andersson
uri:http://liorean.web-graphics.com/
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] HTML Numeric and Named Entities

2006-01-10 Thread Lachlan Hunt

liorean wrote:

On 11/01/06, Kat [EMAIL PROTECTED] wrote:

Is it safe to use the named references that formerly refered to the control 
characters?


Yes, it's safe to use the named entity references in HTML4, but it's 
easier to just use UTF-8 and type the actual characters instead. 
mdash; (or any other entity reference) has never referred to a control 
character, you're getting confused by the fact that IE (and now every 
other HTML browser, for compatibility) incorrectly interprets character 
references from #128; to #159; (and their hex equivalents) as though 
the Document Character Set were Windows-1252.  This has never been 
defined in any standard, it is nothing more than widely implemented 
broken behaviour.



Multi level answer here:
- text/html: Should be perfectly safe.


Yes, it only depends on the availability of fonts and support for the 
characters used.  Not all characters are supported by every browser. 
For example, the character referred to by shy; (soft-hyphen) isn't 
supported by Mozilla yet.  Also, some older and obsolete browsers don't 
support all named entities.



- application/xhtml+xml: Should be, but isn't, safe except for the
five named entities of XML. Use decimal or hexadecimal character
references instead.
- application/xml: Only safe in validating user agents. Which doesn't
include browsers. So, use decimal or hexadecimal character references.


There is no difference between the handling of the MIME types, both 
require the use of a validating parser to handle named entity 
references.  The exception to the rule is that some browsers, such as 
Mozilla, despite not implementing a validating parser, may have a 
pseudo-DTD catalog containing just these entity references.  Mozilla 
uses this catalog when it encounters an XHTML DOCTYPE in an XML 
document, regardless of the MIME type.  (It works similarly for MathML too).



Character references refer to Unicode code points independent of the
document encoding and character set. At least for HTML4 and XML, if
not for HTML3.2.


As far as character references in HTML are concerned, they have always 
referred to the Unicode code points since HTML 2.0.


See my article:
http://lachy.id.au/log/2005/10/char-refs
(take note of the comments too, which contain a few corrections)

--
Lachlan Hunt
http://lachy.id.au/

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**