Re: [whatwg] Internal character encoding declaration, Drop UTF-32, and UTF and BOM terminology

2007-06-23 Thread Ian Hickson
On Sat, 11 Mar 2006, Henri Sivonen wrote:
 
 I think allowing in-place decoder change (when feasible) would be good 
 for performance.

Done.


   I think it would be beneficial to additionally stipulate that
  
   1. The meta element-based character encoding information declaration 
   is expected to work only if the Basic Latin range of characters maps 
   to the same bytes as in the US-ASCII encoding.
  
  Is this realistic? I'm not really familiar enough with character 
  encodings to say if this is what happens in general.
 
 I suppose it is realistic. See below.

That was already there, turns out.


   2. If there is no external character encoding information nor a BOM 
   (see below), there MUST NOT be any non-ASCII bytes in the document 
   byte stream before the end of the meta element that declares the 
   character encoding. (In practice this would ban unescaped non-ASCII 
   class names on the html and [head] elements and non-ASCII comments 
   at the beginning of the document.)
  
  Again, can we realistically require this? I need to do some studies of 
  non-latin pages, I guess.
 
 As UA behavior, no. As a conformance requirement, maybe.

I don't think we should require this, given the preparse step. I can if 
people think we should, though.


Authors should avoid including inline character encoding 
information. Character encoding information should instead be 
included at the transport level (e.g. using the HTTP Content-Type 
header).
   
   I disagree.
   
   With HTML with contemporary UAs, there is no real harm in including 
   the character encoding information both on the HTTP level and in the 
   meta as long as the information is not contradictory. On the 
   contrary, the author-provided internal information is actually 
   useful when end users save pages to disk using UAs that do not 
   reserialize with internal character encoding information.
  
  ...and it breaks everything when you have a transcoding proxy, or 
  similar.
 
 Well, not until you save to disk, since HTTP takes precedence. However, 
 authors can escape this by using UTF-8. (Assuming here that tampering 
 with UTF-8 would be harmful, wrong and pointless.)
 
 Interestingly, transcoding proxies tend to be brought up by residents of 
 Western Europe, North America or the Commonwealth. I have never seen a 
 Russion person living in Russia or a Japanese person living in Japan 
 talk about transcoding proxies in any online or offline discussion. 
 That's why I doubt the importance of transcoding proxies.

I think this discouragement has been removed now. Let me know if it lives 
on somewhere.


  Character encoding information shouldn't be duplicated, IMHO, that's 
  just asking for trouble.
 
 I suggest a mismatch be considered an easy parse error and, therefore, 
 reportable.

I believe this is required in the spec.


For HTML, user agents must use the following algorithm in 
determining the character encoding of a document:
1. If the transport layer specifies an encoding, use that.
   
   Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; 
   UTF-32 makes no practical sense for interchange on the Web.)
  
  I don't know, should there?
 
 I believe there should.

There's a BOM step in the spec; let me know if you think it's in the wrong 
place.


2. Otherwise, if the user agent can find a meta element that 
specifies character encoding information (as described above), 
then use that.
   
   If a conformance checker has not determined the character encoding 
   by now, what should it do? Should it report the document as 
   non-conforming (my preferred choice)? Should it default to US-ASCII 
   and report any non-ASCII bytes as conformance errors? Should it 
   continue to the fuzzier steps like browsers would (hopefully not)?
  
  Again, I don't know.
 
 I'll continue to treat such documents as non-conforming, then.

I've made it non-conforming to not use ASCII if you've got no encoding 
information and no BOM.


 Notably, character encodings that I am aware of and [aren't 
 ASCII-compatible] are:

 JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat 
 and x-MacSymbol, UTF-7, UTF-16 and UTF-32.
 
 The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web pages. 
 After browsing the encoding menus of Firefox, Opera and Safari, I'm 
 pretty confident that the legacy IBM codepages are irrelevant as well.
 
 I suggest the following algorithm as a starting point. It does not handle
 UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208.

I've made those either MUST NOTs or SHOULD NOTs, amongst others.


 Set the REWIND flag to unraised.

The REWIND idea sadly doesn't work very well given that you can actually 
have things like javascript: URIs and event handlers that execute on 
content in the head, in pathological cases.

However, I did something similar in the spec as it stands now.


 Requirements I'd like to see:
 
 Documents must specify a 

Re: [whatwg] Entity parsing

2007-06-23 Thread Sander
I hadn't thought of that one ;-)  (in Dutch there are no native words 
with umlauts, only some of German or Scandinavian descent).
My question was about char-sets that contain both a trema version and a 
(seperate) umlaut version of the same character. Are there any?


cheers,
Sander


Kristof Zelechovski schreef:

Only the vowel U can have either but I have not seen a valid example of
utrema;.  The orthography ambigüe has recently been changed to ambiguë
for consistency.  Polish nauka (science) and German beurteilen would
make good candidates but the national rules of orthography do not allow this
distinction because Slavic languages do not have diphthongs except in
borrowed words and it would cause ambiguity in German (cf. geübt).
(Incidentally, this leads to bad pronunciation often encountered even in
Polish media.)
Cheers
Chris

-Original Message-
From: Sander [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 22, 2007 9:26 PM

To: Kristof Zelechovski
Subject: Re: [whatwg] Entity parsing


Kristof Zelechovski schreef:
  

A dieresis is not an umlaut so I have to bite my tongue each time I write


or
  

read nonsense like iuml;.  It feels like lying.  Umlaut means mixed, a
dieresis means standalone.  Those are very different things, and I can
never gets mixed so there is no ambiguïty.  Since umlaut is borrowed


from
  

German, I can see no problem in borrowing tréma from French.  I


personally
  

prefer itrema; to idier; because of readability, but I would not
insist on that.
  



In professional typography, umlaut dots are usually a bit closer to the 
letter's body than the dots of the trema. In handwriting, however, no 
distinction is visible between the two. This is also true for most 
computer fonts and encodings.

[http://en.wikipedia.org/wiki/Umlaut_(diacritic)]

Are there any char-sets that have both umlaut and trema variations of 
characters? If so, both entities could exist.


cheers,
Sander


PS: I'd go for itrema; instead of idier; as well as the term 
trema is also the one that's used in Dutch.



  


Re: [whatwg] Entity parsing

2007-06-23 Thread Allan Sandfeld Jensen
On Friday 15 June 2007 03:05, Ian Hickson wrote:
 On Sun, 5 Nov 2006, �istein E. Andersen wrote:
  From section 9.2.3.1. Tokenising entities:
For some entities, UAs require a semicolon, for others they don't.
 
  This applies to IE.
 
  FWIW, the entities not requiring a semicolon are the ones encoding
  Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as
  well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT
  and REG). [...]

 I've defined the parsing and conformance requirements in a way that
 matches IE. As a side-effect, this has made things like naiumlve
 actually conforming. I don't know if we want this. On the one hand, it's
 pragmatic (after all, why require the semicolon?), and is equivalent to
 not requiring quotes around attribute values. On the other, people don't
 want us to make the quotes optional either.

What about the Gecko entity parsing extension?

- IE consitently parses unterminated entities from latin-1
- Gecko parses all unterminated entities, even those beyond latin-1, but only 
in text-content, not in attributes. (seems my recent firefox also supports 
the IE parsing in attributes now.)

See the attached test-case.

`Allan



Test of HTML entities in quirky mode:


amp;	

amp	

ample	

not;	

not	

notat	

notin;	

notin	

notina	

ge;	

ge	

gel	


Test of entities in attributes:



Re: [whatwg] Entity parsing

2007-06-23 Thread Sam Ruby

On 6/14/07, Ian Hickson [EMAIL PROTECTED] wrote:

On Sun, 5 Nov 2006, Øistein E. Andersen wrote:

 From section 9.2.3.1. Tokenising entities:
   For some entities, UAs require a semicolon, for others they don't.

 This applies to IE.

 FWIW, the entities not requiring a semicolon are the ones encoding
 Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as
 well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT
 and REG). [...]

I've defined the parsing and conformance requirements in a way that
matches IE. As a side-effect, this has made things like naiumlve
actually conforming. I don't know if we want this. On the one hand, it's
pragmatic (after all, why require the semicolon?), and is equivalent to
not requiring quotes around attribute values. On the other, people don't
want us to make the quotes optional either.


With the latest changes to html5lib, we get a failure on a test named
test_title_body_named_charref.

Before, A mdash B == A — B, now A mdash B == A amp;mdash B.

Is that what we really want?  Testing with Firefox, the old behavior
is preferable.

- Sam Ruby


[whatwg] Canvas patterns, and miscellaneous other things

2007-06-23 Thread Philip Taylor

What should happen if you try drawing a 0x0-pixel repeating pattern?
(I can't find a way to make a 0x0 image that any browser will load,
but the spec says you can make a 0x0 canvas. Firefox and Opera can't
make a 0x0 canvas - it acts like it's 300x150 pixels instead. Safari
returns null from createPattern when it's 0x0.)


On a somewhat related note: What should canvas.width = canvas.height
= 0; canvas.toDataURL() do, given that you can never make a valid 0x0
PNG? (Firefox and Opera make the canvas 300x150 pixels instead, so you
can't actually get it that small. Safari can make it that small, but
doesn't implement toDataURL.)

Similarly, what should toDataURL do when the canvas is really large
and the browser doesn't want to give you a data URI? (Opera returns
'undefined' if it's = 30001 pixels in any dimension, and crashes if
it's 3 in each dimension. Firefox (2 and trunk) crashes or hangs
on Linux if it's = 32768 pixels in any dimension, and crashes on
Windows if it's = 65536 pixels).

More generally, the spec says If the user agent does not support the
requested type, it must return the image using the PNG format - what
if it does support the requested type, but still doesn't want to give
you a data URI, e.g. because it's the wrong size (too large, too
small, not a multiple of 4, etc) or because of other environmental
factors (e.g. it wants you to do
getContext('vendor-2d').enableVectorCapture() before
toDataURL('image/svg+xml'))? (Presumably it would be some combination
of falling back to PNG (if you asked for something else), returning
undefined, and throwing exceptions.)


If the empty string or null is specified, repeat must be assumed. -
why allow null, but not undefined or missing? (It would seem quite
reasonable for createPattern(img) to default to a repeating pattern).
(Currently all implementations throw exceptions for undefined/missing,
and Opera and Safari throw for null.)


'complete' for images is underspecified, so it's not possible to test
the related createPattern/drawImage requirements. (Is it set before
onload is called? Can it be set as soon as the Image() constructor
returns? Can it be set at an arbitrary point during execution of the
script that called the Image() constructor? Is it reset when you
change src? etc. Implementations all seem to disagree in lots of
ways.)


About radial gradients: If x0 = x1 and y0 = y1 and r0 = r1, then the
radial gradient must paint nothing. - that conflicts with the
previous must for following the algorithm, so it's not precise about
which you must do. It should probably say If ... then the radial
gradient must paint nothing. Otherwise, radial gradients must be
rendered by following these steps:.


code title=dom-attr-completecomplete/code (twice) - looks like
it should be dom-img-complete, so it points to #complete.

createPattern(image, repetition) - the parameters should be in vars.

The images are not be scaled by this process - s/be //

interface HTMLCanvasElement : HTMLElement {
 attribute unsigned long width;
 attribute unsigned long height;
^ incorrect indentation (should have two more spaces).

Somewhere totally unrelated:
interface HTMLDetailsElement : HTMLElement {
  attribute boolean open;
^ incorrect indentation (should have nine more spaces).

--
Philip Taylor
[EMAIL PROTECTED]


[whatwg] The issue of interoperability of the video element

2007-06-23 Thread Ivo Emanuel Gonçalves

Dear WHATWG members,

It has come to my attention that Apple developers behind the WebKit
platform, which powers the web browser Safari, apparently intend to
support the video element of the HTML 5 spec, section 3.14.7.  It's
all fine and well, but not a victory for web interoperability, as they
do not intend to follow the User agents should support Theora video
and Vorbis audio, as well as the Ogg container format part.  In their
own words: should support in a spec does not denote a requirement.
We could have a perfectly suitable implementation of audio and video
as seen in this draft spec without having theora/vorbis codecs
available.[1]

What this means, in my opinion, is that they will push for QuickTime
video, in spite of the effort of the Opera developers to push Theora
forward as the de facto standard for web video.  Even if Mozilla and
the KDE team prepare their web browsers to support Theora, by choosing
to alienate it, Apple is allowing Microsoft to put WMV support alone
in their Internet Explorer, for if Apple, one of the big players,
shuns Theora, so will Microsoft.  Considering the statistics, Internet
Explorer being currently the web browser with bigger market share, it
will force pretty much every web designer/programmer to stick to WMV
only.

As everyone is aware, WMV is not an open specification, nor a proper
documented video format.  Instead, it is heavily patented and locked
in one single vendor: Microsoft.  This will force vendors to either
pay a license to legaly use WMV in their platforms, or to reverse
engineer support for it, infriging on software patents in certain
nations.

This message is mostly an open letter to the Apple developers behind
WebKit and to every other browser/UA developer.  Please, do not shun
Theora, or one of the following two things will happen:
1) either the video element will become unrelevant and non-successful,
which is a shame considering its potential to revolutionize the web,
2) or everyone will be locked in whatever new version of WMV Microsoft
releases in the following years--and expect some of these to be
incompatible between each other.

Best regards,
Ivo Emanuel Gonçalves

[1] http://bugs.webkit.org/show_bug.cgi?id=13708


[whatwg] Feature request: Provide event to detect url hash (named anchor) change

2007-06-23 Thread Agustín Fernández
Hi,

There is currently no way to detect a change in the url of a page other
than polling for changes in document.location.hash all the time (which
is slow and potentially complex, and doesn't always work in IE -) or
listening for click events on all links (which doesn't catch changes not
started by clicking on links -such as clicking back and forward, or
changing the hash by hand-).

Changing the hash in a page is useful to provide bookmarks and back and
forward functionality in ajax driven web applications which never fetch
a new page and is used extensively in the web.

I propose an urlchange (urlhashchange? hashchange? locationchange?)
event which would be dispatched by the BODY element whenever the hash
portion of the url changes.

You can see an example app which uses this in
http://mini.adondevamos.com/ (in spanish).

I filed this in bugzilla:

https://bugzilla.mozilla.org/show_bug.cgi?id=385434

. A .



Re: [whatwg] Entity parsing [trema/diæresis vs umlaut]

2007-06-23 Thread Øistein E . Andersen
Sander wrote:

 Are there any char-sets that have both umlaut and trema variations of 
 characters?

Unicode does not make the distinction, so this is somewhat unlikely.

(Personally, I tend to think that the apparent preference for umlaut dots closer
to the letter than trema dots can be linked to extrinsic phenomena like the
preference for steep accents in French typography.)

Kristof Zelechovski wrote:

 Only the vowel U can have either

This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the 
trema/diæresis
(ä, ë, ï, ö, ü in Dutch; ë, ï, ü*, ÿ** in French), and a, o, u can all be 
umlauted (ä, ö, ü
in German).

Moreover, the double-dot accent also has other uses (e.g., ä and ë both 
designate
a stressed schwa in Luxembourgeois), so it is probably not advisable
to attempt a complete classification in HTML.

-- 
Øistein E. Andersen

*) possibly only in the word capharnaüm (disregarding the highly unpopular
rectifications orthographiques of 1990) and in proper names
**) only in proper names