RE: Benefits of Unicode

Richard, Francois M Mon, 29 Jan 2001 08:11:45 -0800
More questions...

> More to Francois:
> 
> 
> >> When I create and exchange an HTML file for instance:
> >> <HTML>
> >> <TITLE>bla</TITLE>
> >> </HTML>
> >>
> >> only 'bla' is plain text. To conform to Unicode, does it 
> mean I have to
> use
> >> the Unicode character set and encoding ONLY for 'bla'? (which would
> indicate
> >> mixing of character encoding in one single file)
> 
> The HTML spec uses textual markup (as opposed to some binary 
> file format),
> so what constitute plain text depends on how you're 
> interpreting an HTML
> file. To an HTML parser, in a sense, it's all plain text; that is, the
> parser has to interpret the plain text date to identify 
> tokens like <HTML>.
> After that parsing has occured, then at a different level the 
> file has been
> analysed into content portions and markup portions, and at 
> this level only
> the content portions are seen as plain text.

That would be my next question: Although I might have an HTML file encoded
in iso-8859-1, the parser has to interpret following the markup AND using
the Unicode repertoire (CCS). 
Does this flexibility is taken into consideration anywhere into Unicode?

 
> 
> While it might be possible to create an HTML-like 
> specification in which
> the markup and the content could potential be in different 
> encodings (with
> some constraints: you need to avoid byte sequences in content 
> that can be
> wrongly interpreted as markup), this is no the case for HTML 
> or for XML:
> the entire file, markup and content, must be in the same encoding.
> 
> 
> 
> >> The second problem I can see with Unicode is the fact that 
> although the
> >> character set is universal, the encoding forms are multiple (UTF-8,
> UTF-16
> >> and UTF-32).
> 
> How is that a problem? It is the kind of flexibility that 
> makes Unicode
> very practical for implementers. It may be necessary to 
> translate from one
> encoding form to another on occasion, but that is very simple.
> 
Since there is more than one encoding for Unicode CCS, the encoding used has
to be declared. I thought the idea behind Unicode was to be unambiguous. I
see an unambiguous CCS, but not for CEF or CES.

And this is going back to my first question:
If a protocol or format allow different CEF and CES (not only UTFs) but
provide a way to access the whole Unicode CCS and agents have to interpret
it following the Unicode CCS, isn't this compliant (in a way. The "CCS"
way.) with Unicode?

-Francois

> 
> 
> 
> - Peter
> 
> 
> --------------------------------------------------------------
> -------------
> Peter Constable
> 
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
> 
>
RE: Benefits of Unicode

Reply via email to