Re: [whatwg] Internal character encoding declaration, Drop UTF-32, and UTF and BOM terminology
On Sat, 11 Mar 2006, Henri Sivonen wrote: I think allowing in-place decoder change (when feasible) would be good for performance. Done. I think it would be beneficial to additionally stipulate that 1. The meta element-based character encoding information declaration is expected to work only if the Basic Latin range of characters maps to the same bytes as in the US-ASCII encoding. Is this realistic? I'm not really familiar enough with character encodings to say if this is what happens in general. I suppose it is realistic. See below. That was already there, turns out. 2. If there is no external character encoding information nor a BOM (see below), there MUST NOT be any non-ASCII bytes in the document byte stream before the end of the meta element that declares the character encoding. (In practice this would ban unescaped non-ASCII class names on the html and [head] elements and non-ASCII comments at the beginning of the document.) Again, can we realistically require this? I need to do some studies of non-latin pages, I guess. As UA behavior, no. As a conformance requirement, maybe. I don't think we should require this, given the preparse step. I can if people think we should, though. Authors should avoid including inline character encoding information. Character encoding information should instead be included at the transport level (e.g. using the HTTP Content-Type header). I disagree. With HTML with contemporary UAs, there is no real harm in including the character encoding information both on the HTTP level and in the meta as long as the information is not contradictory. On the contrary, the author-provided internal information is actually useful when end users save pages to disk using UAs that do not reserialize with internal character encoding information. ...and it breaks everything when you have a transcoding proxy, or similar. Well, not until you save to disk, since HTTP takes precedence. However, authors can escape this by using UTF-8. (Assuming here that tampering with UTF-8 would be harmful, wrong and pointless.) Interestingly, transcoding proxies tend to be brought up by residents of Western Europe, North America or the Commonwealth. I have never seen a Russion person living in Russia or a Japanese person living in Japan talk about transcoding proxies in any online or offline discussion. That's why I doubt the importance of transcoding proxies. I think this discouragement has been removed now. Let me know if it lives on somewhere. Character encoding information shouldn't be duplicated, IMHO, that's just asking for trouble. I suggest a mismatch be considered an easy parse error and, therefore, reportable. I believe this is required in the spec. For HTML, user agents must use the following algorithm in determining the character encoding of a document: 1. If the transport layer specifies an encoding, use that. Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; UTF-32 makes no practical sense for interchange on the Web.) I don't know, should there? I believe there should. There's a BOM step in the spec; let me know if you think it's in the wrong place. 2. Otherwise, if the user agent can find a meta element that specifies character encoding information (as described above), then use that. If a conformance checker has not determined the character encoding by now, what should it do? Should it report the document as non-conforming (my preferred choice)? Should it default to US-ASCII and report any non-ASCII bytes as conformance errors? Should it continue to the fuzzier steps like browsers would (hopefully not)? Again, I don't know. I'll continue to treat such documents as non-conforming, then. I've made it non-conforming to not use ASCII if you've got no encoding information and no BOM. Notably, character encodings that I am aware of and [aren't ASCII-compatible] are: JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat and x-MacSymbol, UTF-7, UTF-16 and UTF-32. The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web pages. After browsing the encoding menus of Firefox, Opera and Safari, I'm pretty confident that the legacy IBM codepages are irrelevant as well. I suggest the following algorithm as a starting point. It does not handle UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208. I've made those either MUST NOTs or SHOULD NOTs, amongst others. Set the REWIND flag to unraised. The REWIND idea sadly doesn't work very well given that you can actually have things like javascript: URIs and event handlers that execute on content in the head, in pathological cases. However, I did something similar in the spec as it stands now. Requirements I'd like to see: Documents must specify a
Re: [whatwg] Entity parsing
I hadn't thought of that one ;-) (in Dutch there are no native words with umlauts, only some of German or Scandinavian descent). My question was about char-sets that contain both a trema version and a (seperate) umlaut version of the same character. Are there any? cheers, Sander Kristof Zelechovski schreef: Only the vowel U can have either but I have not seen a valid example of utrema;. The orthography ambigüe has recently been changed to ambiguë for consistency. Polish nauka (science) and German beurteilen would make good candidates but the national rules of orthography do not allow this distinction because Slavic languages do not have diphthongs except in borrowed words and it would cause ambiguity in German (cf. geübt). (Incidentally, this leads to bad pronunciation often encountered even in Polish media.) Cheers Chris -Original Message- From: Sander [mailto:[EMAIL PROTECTED] Sent: Friday, June 22, 2007 9:26 PM To: Kristof Zelechovski Subject: Re: [whatwg] Entity parsing Kristof Zelechovski schreef: A dieresis is not an umlaut so I have to bite my tongue each time I write or read nonsense like iuml;. It feels like lying. Umlaut means mixed, a dieresis means standalone. Those are very different things, and I can never gets mixed so there is no ambiguïty. Since umlaut is borrowed from German, I can see no problem in borrowing tréma from French. I personally prefer itrema; to idier; because of readability, but I would not insist on that. In professional typography, umlaut dots are usually a bit closer to the letter's body than the dots of the trema. In handwriting, however, no distinction is visible between the two. This is also true for most computer fonts and encodings. [http://en.wikipedia.org/wiki/Umlaut_(diacritic)] Are there any char-sets that have both umlaut and trema variations of characters? If so, both entities could exist. cheers, Sander PS: I'd go for itrema; instead of idier; as well as the term trema is also the one that's used in Dutch.
Re: [whatwg] Entity parsing
On Friday 15 June 2007 03:05, Ian Hickson wrote: On Sun, 5 Nov 2006, �istein E. Andersen wrote: From section 9.2.3.1. Tokenising entities: For some entities, UAs require a semicolon, for others they don't. This applies to IE. FWIW, the entities not requiring a semicolon are the ones encoding Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT and REG). [...] I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. On the one hand, it's pragmatic (after all, why require the semicolon?), and is equivalent to not requiring quotes around attribute values. On the other, people don't want us to make the quotes optional either. What about the Gecko entity parsing extension? - IE consitently parses unterminated entities from latin-1 - Gecko parses all unterminated entities, even those beyond latin-1, but only in text-content, not in attributes. (seems my recent firefox also supports the IE parsing in attributes now.) See the attached test-case. `Allan Test of HTML entities in quirky mode: amp; amp ample not; not notat notin; notin notina ge; ge gel Test of entities in attributes:
Re: [whatwg] Entity parsing
On 6/14/07, Ian Hickson [EMAIL PROTECTED] wrote: On Sun, 5 Nov 2006, Øistein E. Andersen wrote: From section 9.2.3.1. Tokenising entities: For some entities, UAs require a semicolon, for others they don't. This applies to IE. FWIW, the entities not requiring a semicolon are the ones encoding Latin-1 characters, the other HTML 3.2 entities (amp, gt and lt), as well as quot and the uppercase variants (AMP, COPY, GT, LT, QUOT and REG). [...] I've defined the parsing and conformance requirements in a way that matches IE. As a side-effect, this has made things like naiumlve actually conforming. I don't know if we want this. On the one hand, it's pragmatic (after all, why require the semicolon?), and is equivalent to not requiring quotes around attribute values. On the other, people don't want us to make the quotes optional either. With the latest changes to html5lib, we get a failure on a test named test_title_body_named_charref. Before, A mdash B == A — B, now A mdash B == A amp;mdash B. Is that what we really want? Testing with Firefox, the old behavior is preferable. - Sam Ruby
[whatwg] Canvas patterns, and miscellaneous other things
What should happen if you try drawing a 0x0-pixel repeating pattern? (I can't find a way to make a 0x0 image that any browser will load, but the spec says you can make a 0x0 canvas. Firefox and Opera can't make a 0x0 canvas - it acts like it's 300x150 pixels instead. Safari returns null from createPattern when it's 0x0.) On a somewhat related note: What should canvas.width = canvas.height = 0; canvas.toDataURL() do, given that you can never make a valid 0x0 PNG? (Firefox and Opera make the canvas 300x150 pixels instead, so you can't actually get it that small. Safari can make it that small, but doesn't implement toDataURL.) Similarly, what should toDataURL do when the canvas is really large and the browser doesn't want to give you a data URI? (Opera returns 'undefined' if it's = 30001 pixels in any dimension, and crashes if it's 3 in each dimension. Firefox (2 and trunk) crashes or hangs on Linux if it's = 32768 pixels in any dimension, and crashes on Windows if it's = 65536 pixels). More generally, the spec says If the user agent does not support the requested type, it must return the image using the PNG format - what if it does support the requested type, but still doesn't want to give you a data URI, e.g. because it's the wrong size (too large, too small, not a multiple of 4, etc) or because of other environmental factors (e.g. it wants you to do getContext('vendor-2d').enableVectorCapture() before toDataURL('image/svg+xml'))? (Presumably it would be some combination of falling back to PNG (if you asked for something else), returning undefined, and throwing exceptions.) If the empty string or null is specified, repeat must be assumed. - why allow null, but not undefined or missing? (It would seem quite reasonable for createPattern(img) to default to a repeating pattern). (Currently all implementations throw exceptions for undefined/missing, and Opera and Safari throw for null.) 'complete' for images is underspecified, so it's not possible to test the related createPattern/drawImage requirements. (Is it set before onload is called? Can it be set as soon as the Image() constructor returns? Can it be set at an arbitrary point during execution of the script that called the Image() constructor? Is it reset when you change src? etc. Implementations all seem to disagree in lots of ways.) About radial gradients: If x0 = x1 and y0 = y1 and r0 = r1, then the radial gradient must paint nothing. - that conflicts with the previous must for following the algorithm, so it's not precise about which you must do. It should probably say If ... then the radial gradient must paint nothing. Otherwise, radial gradients must be rendered by following these steps:. code title=dom-attr-completecomplete/code (twice) - looks like it should be dom-img-complete, so it points to #complete. createPattern(image, repetition) - the parameters should be in vars. The images are not be scaled by this process - s/be // interface HTMLCanvasElement : HTMLElement { attribute unsigned long width; attribute unsigned long height; ^ incorrect indentation (should have two more spaces). Somewhere totally unrelated: interface HTMLDetailsElement : HTMLElement { attribute boolean open; ^ incorrect indentation (should have nine more spaces). -- Philip Taylor [EMAIL PROTECTED]
[whatwg] The issue of interoperability of the video element
Dear WHATWG members, It has come to my attention that Apple developers behind the WebKit platform, which powers the web browser Safari, apparently intend to support the video element of the HTML 5 spec, section 3.14.7. It's all fine and well, but not a victory for web interoperability, as they do not intend to follow the User agents should support Theora video and Vorbis audio, as well as the Ogg container format part. In their own words: should support in a spec does not denote a requirement. We could have a perfectly suitable implementation of audio and video as seen in this draft spec without having theora/vorbis codecs available.[1] What this means, in my opinion, is that they will push for QuickTime video, in spite of the effort of the Opera developers to push Theora forward as the de facto standard for web video. Even if Mozilla and the KDE team prepare their web browsers to support Theora, by choosing to alienate it, Apple is allowing Microsoft to put WMV support alone in their Internet Explorer, for if Apple, one of the big players, shuns Theora, so will Microsoft. Considering the statistics, Internet Explorer being currently the web browser with bigger market share, it will force pretty much every web designer/programmer to stick to WMV only. As everyone is aware, WMV is not an open specification, nor a proper documented video format. Instead, it is heavily patented and locked in one single vendor: Microsoft. This will force vendors to either pay a license to legaly use WMV in their platforms, or to reverse engineer support for it, infriging on software patents in certain nations. This message is mostly an open letter to the Apple developers behind WebKit and to every other browser/UA developer. Please, do not shun Theora, or one of the following two things will happen: 1) either the video element will become unrelevant and non-successful, which is a shame considering its potential to revolutionize the web, 2) or everyone will be locked in whatever new version of WMV Microsoft releases in the following years--and expect some of these to be incompatible between each other. Best regards, Ivo Emanuel Gonçalves [1] http://bugs.webkit.org/show_bug.cgi?id=13708
[whatwg] Feature request: Provide event to detect url hash (named anchor) change
Hi, There is currently no way to detect a change in the url of a page other than polling for changes in document.location.hash all the time (which is slow and potentially complex, and doesn't always work in IE -) or listening for click events on all links (which doesn't catch changes not started by clicking on links -such as clicking back and forward, or changing the hash by hand-). Changing the hash in a page is useful to provide bookmarks and back and forward functionality in ajax driven web applications which never fetch a new page and is used extensively in the web. I propose an urlchange (urlhashchange? hashchange? locationchange?) event which would be dispatched by the BODY element whenever the hash portion of the url changes. You can see an example app which uses this in http://mini.adondevamos.com/ (in spanish). I filed this in bugzilla: https://bugzilla.mozilla.org/show_bug.cgi?id=385434 . A .
Re: [whatwg] Entity parsing [trema/diæresis vs umlaut]
Sander wrote: Are there any char-sets that have both umlaut and trema variations of characters? Unicode does not make the distinction, so this is somewhat unlikely. (Personally, I tend to think that the apparent preference for umlaut dots closer to the letter than trema dots can be linked to extrinsic phenomena like the preference for steep accents in French typography.) Kristof Zelechovski wrote: Only the vowel U can have either This is not quite right. All Latin vowels (a, e, i, o, u, y) can take the trema/diæresis (ä, ë, ï, ö, ü in Dutch; ë, ï, ü*, ÿ** in French), and a, o, u can all be umlauted (ä, ö, ü in German). Moreover, the double-dot accent also has other uses (e.g., ä and ë both designate a stressed schwa in Luxembourgeois), so it is probably not advisable to attempt a complete classification in HTML. -- Øistein E. Andersen *) possibly only in the word capharnaüm (disregarding the highly unpopular rectifications orthographiques of 1990) and in proper names **) only in proper names