Re: [whatwg] Byte-wise tokenization algorithm
On 21 Dec 2008, at 05:41, Ian Hickson wrote: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters in ASCII space. Yes. (At least, that's the intent; if you find anything that contradicts that, please let me know.) Indeed it is possible (or at least it certainly was a year and a half ago, but I have seen nothing change that would stop it). 2. Would such an implementation be conforming? Looking just at parsing, yes, probably... But an HTML5 implementation, according to the spec, must at a minimum support the UTF-8 and Windows-1252 encodings, so the overall implementation might not depending on exactly how this is done. That should be no problem: just convert Windows-1252 to UTF-8 using strtr() (as it is a SBCS this is simple enough — doing the inverse is not) — see the attached file. Then all you need to do is normalize the character set name to match all aliases of Windows-1252 and UTF-8, as well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to Windows-1252. http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php does that (the only dependancy is for getting the file via HTTP, that can just be replaced with cURL if you wish to just require that). -- Geoffrey Sneddon http://gsnedders.com/ ?php /** * Converts a Windows-1252 encoded string to a UTF-8 encoded string * * @copyright 2008 Geoffrey Sneddon * @license http://www.opensource.org/licenses/bsd-license.php BSD License * @param string $string Windows-1252 encoded string * @return string UTF-8 encoded string */ function windows_1252_to_utf8($string) { static $convert_table = array( \x80 = \xE2\x82\xAC, \x81 = \xEF\xBF\xBD, \x82 = \xE2\x80\x9A, \x83 = \xC6\x92, \x84 = \xE2\x80\x9E, \x85 = \xE2\x80\xA6, \x86 = \xE2\x80\xA0, \x87 = \xE2\x80\xA1, \x88 = \xCB\x86, \x89 = \xE2\x80\xB0, \x8A = \xC5\xA0, \x8B = \xE2\x80\xB9, \x8C = \xC5\x92, \x8D = \xEF\xBF\xBD, \x8E = \xC5\xBD, \x8F = \xEF\xBF\xBD, \x90 = \xEF\xBF\xBD, \x91 = \xE2\x80\x98, \x92 = \xE2\x80\x99, \x93 = \xE2\x80\x9C, \x94 = \xE2\x80\x9D, \x95 = \xE2\x80\xA2, \x96 = \xE2\x80\x93, \x97 = \xE2\x80\x94, \x98 = \xCB\x9C, \x99 = \xE2\x84\xA2, \x9A = \xC5\xA1, \x9B = \xE2\x80\xBA, \x9C = \xC5\x93, \x9D = \xEF\xBF\xBD, \x9E = \xC5\xBE, \x9F = \xC5\xB8, \xA0 = \xC2\xA0, \xA1 = \xC2\xA1, \xA2 = \xC2\xA2, \xA3 = \xC2\xA3, \xA4 = \xC2\xA4, \xA5 = \xC2\xA5, \xA6 = \xC2\xA6, \xA7 = \xC2\xA7, \xA8 = \xC2\xA8, \xA9 = \xC2\xA9, \xAA = \xC2\xAA, \xAB = \xC2\xAB, \xAC = \xC2\xAC, \xAD = \xC2\xAD, \xAE = \xC2\xAE, \xAF = \xC2\xAF, \xB0 = \xC2\xB0, \xB1 = \xC2\xB1, \xB2 = \xC2\xB2, \xB3 = \xC2\xB3, \xB4 = \xC2\xB4, \xB5 = \xC2\xB5, \xB6 = \xC2\xB6, \xB7 = \xC2\xB7, \xB8 = \xC2\xB8, \xB9 = \xC2\xB9, \xBA = \xC2\xBA, \xBB = \xC2\xBB, \xBC = \xC2\xBC, \xBD = \xC2\xBD, \xBE = \xC2\xBE, \xBF = \xC2\xBF, \xC0 = \xC3\x80, \xC1 = \xC3\x81, \xC2 = \xC3\x82, \xC3 = \xC3\x83, \xC4 = \xC3\x84, \xC5 = \xC3\x85, \xC6 = \xC3\x86, \xC7 = \xC3\x87, \xC8 = \xC3\x88, \xC9 = \xC3\x89, \xCA = \xC3\x8A, \xCB = \xC3\x8B, \xCC = \xC3\x8C, \xCD = \xC3\x8D, \xCE = \xC3\x8E, \xCF = \xC3\x8F, \xD0 = \xC3\x90, \xD1 = \xC3\x91, \xD2 = \xC3\x92, \xD3 = \xC3\x93, \xD4 = \xC3\x94, \xD5 = \xC3\x95, \xD6 = \xC3\x96, \xD7 = \xC3\x97, \xD8 = \xC3\x98, \xD9 = \xC3\x99, \xDA = \xC3\x9A, \xDB = \xC3\x9B, \xDC = \xC3\x9C, \xDD = \xC3\x9D, \xDE = \xC3\x9E, \xDF = \xC3\x9F, \xE0 = \xC3\xA0, \xE1 = \xC3\xA1, \xE2 = \xC3\xA2, \xE3 = \xC3\xA3, \xE4 = \xC3\xA4, \xE5 = \xC3\xA5, \xE6 = \xC3\xA6, \xE7 = \xC3\xA7, \xE8 = \xC3\xA8, \xE9 = \xC3\xA9, \xEA = \xC3\xAA, \xEB = \xC3\xAB, \xEC = \xC3\xAC, \xED = \xC3\xAD, \xEE = \xC3\xAE, \xEF = \xC3\xAF, \xF0 = \xC3\xB0, \xF1 = \xC3\xB1, \xF2 = \xC3\xB2, \xF3 = \xC3\xB3, \xF4 = \xC3\xB4, \xF5 = \xC3\xB5, \xF6 = \xC3\xB6, \xF7 = \xC3\xB7, \xF8 = \xC3\xB8, \xF9 = \xC3\xB9, \xFA = \xC3\xBA,
Re: [whatwg] Thoughts on HTML 5
Ian Hickson schrieb: Deprecating HTML thus seems like vain effort. (We already tried over the past few years with XHTML 1.x, and it didn't work.) I'd say it _did_ work. :-) Philipp Kempgen
Re: [whatwg] Byte-wise tokenization algorithm
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson i...@hixie.ch wrote: On Sat, 20 Dec 2008, Edward Z. Yang wrote: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters in ASCII space. Yes. (At least, that's the intent; if you find anything that contradicts that, please let me know.) I think there are some cases where it still should work but you might have to be a little careful - e.g. tablefoo notionally results in three parse errors according to the spec (one for each character token which gets foster-parented), so table☹ results in one if you work with Unicode characters but three if you treat each UTF-8 byte as a separate character token. But in practice, tokenisers emit sequence-of-many-characters tokens instead of single-character tokens, so they only emit one parse error for tablefoo, and the html5lib test cases assume that behaviour, and it should work identically if you have sequence-of-many-bytes tokens instead. (Apparently only the distinction between 0 and more-than-0 parse errors is important as far as the spec is concerned, since that has an effect on whether the document is conforming; but it seems useful for implementors to share test cases that are precise about exactly where all the parse errors are emitted, since that helps find bugs, and so the parse error count is relevant.) -- Philip Taylor exc...@gmail.com
Re: [whatwg] Byte-wise tokenization algorithm
On 21 Dec 2008, at 16:35, Edward Z. Yang wrote: I suppose the big pivot point is as if. A byte-wise implementation would replace character globally with byte, and any U+ designation with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not the actual algorithm implementation, no? It states that what is done must be wholly equivalent to the given algorithm. But an HTML5 implementation, according to the spec, must at a minimum support the UTF-8 and Windows-1252 encodings, so the overall implementation might not depending on exactly how this is done. The plan is to convert Windows-1252 into UTF-8 before processing; with a reasonably good iconv implementation, support for lots of encodings is possible. The implementation might not be fully conforming if iconv doesn't perform the proper (possibly context-sensitive; I haven't checked) substitution when it doesn't recognize a character, but it should be close. I've never seen any way of getting iconv (at least via PHP) to do what HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is, however, possible using mbstring (which also has the advantage of not being system dependant), as well as with PHP6's Unicode support. -- Geoffrey Sneddon http://gsnedders.com/
Re: [whatwg] Thoughts on HTML 5
Am Sonntag, den 21.12.2008, 17:54 +0100 schrieb Philipp Kempgen: Ian Hickson schrieb: Deprecating HTML thus seems like vain effort. (We already tried over the past few years with XHTML 1.x, and it didn't work.) I'd say it _did_ work. :-) I'd say too: The worst abominations have disappeared (for new sites, that is). the font element, for example, or frames through deprecating them. Fact: Deprecating stuff takes it out of (X)HTML-Books, Howtos like Selfhtml warn against it, thus ensuring lesser use by novices. Does anyone remember marquee ? Cheers -- Nils Dagsson Moskopp http://dieweltistgarnichtso.net
Re: [whatwg] Thoughts on HTML 5
Please Note: all the following is my personal humble opinion. As I discovered lately, the main problem of HTML5 is its design oriented to keep features that are distributed across browsers, that work or that are simple way to solve big problem. Actually, they are a bunch of different features somehow not integrated to the others. Instead, programmer (please note, I use the word programmer, not author or web designer) developing *new* application may more like a more structured and logical organization, like XHTML modularization is. HTML5 features, summed in big groups, are (in spec order): 1) common syntax for the most used datatypes. 2) additional DOM interfaces, which include HTMLElement - HTMLCollection - HTMLFormsControlCollection - HTMLOptionsCollection - DOMTokenList - DOMStringMap 3) Elements and Content Models 4) Element types: metadata - structure - sectioning - grouping - text - editing - embedding - table - forms - interactive - scripting elements 5) User agent requirements 6) User Interaction 7) Communication 8) HTML Syntax Some of these features can be achieved without any of HTML5, for example 1) use XMLSchema datatypes 2) you don't need HTMLElement: markup insertion, attributes querying can be done using DOM3Core (that in latest browser are even more performant as no parser is involved), events are far better handled by DOM3Events, styling is included by CSSOM you don't need collection either: just use appropriate DOMNodeLists, while for DOMStringMap you may use binding specific features (all Object are hash maps in ECMAScript3): it works this way even in HTML5 3) use XHTML2, which is extensible because modularized 4) metadata is better handled by XHTML2 Meta Attributes module, which fully integrates the RDF module in any elements; structure, sectioning, grouping are the same; text is very similar: you don't have time, but you can have span datatype=xsd:date content=2008-12-21Today/span as in HTML5 you have time value=2008-12-21Today/time; for progress and meter semantic you can use role attribute (for styling you always use CSS); editing is the same, but you have an attribute instead of an element, so you don't have the issue that ins and del can contain everything, even a whole document (not including html); embedding is much more powerful as any element can be replaced by embedded content; tables are the same (you don't have tables API; but you can still use DOM3Core); XForms are actually more powerful than WebForms2, since you divide presentation from data from action (that is implemented declaratively); interactive elements are not needed at all: details is better implemented as it is now (ECMAScript3 + CSS3), datagrid is just a way to put data in a tree model: use plain XML for that; command and a in XHTML2 implemented in any element using href attribute; menu is mostly an ul with some style; scripting uses XMLEvents and handler: it looks the same, but it is different as it is more event oriented (scripts are not executed by default, they're executed when some event fires) 8) HTML syntax: as I said before, use XML for that There are instead features that are indeed very useful to develop a web application, but are not achievable using other means that HTML5: 1) some way to interact with object (please note: object, not embed: object is for plugins, embed for content) : actually this can be done using something like cross document messaging, assuming that object creates a new browsing context (it already does if the target is text/html or application/xhtml+xml), but we need a specification for message syntax 2) the binding specific global scope (that is, what object are available in all scopes, if binding supports this); this is normally the window object, but scripts use certain features only on their own browsing context, so that may be moved from that to global scope, removing the whole window object from scope (for current javascript you can write window.window.window.window.window... and get the same as nothing) 3) the Window object (which includes window name, window location, cross document messaging, dialog windows) 4) Protocol and Content Handlers 5) Session and Local storage 6) Database storage 7) Drag and Drop 8) WebSockets What I am asking now is so to modularize HTML. copy those features into separate, interoperable modules, removing legacy features (like window.on-whatever event listener) A copy of those will remain in HTML5, because browser implement them at the moment, and the HTML5 goal is that all browser implement the same things in the same ways Instead, some web developers in the future will think that a modularized and less redudant API is more usable, like I personally do, and switch to that, without mixing with HTML5: actually, I guess what a Database API does inside HTML. Best regards, Giovanni Campagna
Re: [whatwg] Thoughts on HTML 5
Hi Giovanni, I haven't read your entire comment, but with your point eight will break backwards compatibility. As far as I know is HTML5 supposed to combine old and new. The problem with interfaces is that you can not simply change them. That's just a fact we have to deal with. jorgen On Dec 21, 2008, at 7:12 PM, Giovanni Campagna wrote: Please Note: all the following is my personal humble opinion. As I discovered lately, the main problem of HTML5 is its design oriented to keep features that are distributed across browsers, that work or that are simple way to solve big problem. Actually, they are a bunch of different features somehow not integrated to the others. Instead, programmer (please note, I use the word programmer, not author or web designer) developing *new* application may more like a more structured and logical organization, like XHTML modularization is. HTML5 features, summed in big groups, are (in spec order): 1) common syntax for the most used datatypes. 2) additional DOM interfaces, which include HTMLElement - HTMLCollection - HTMLFormsControlCollection - HTMLOptionsCollection - DOMTokenList - DOMStringMap 3) Elements and Content Models 4) Element types: metadata - structure - sectioning - grouping - text - editing - embedding - table - forms - interactive - scripting elements 5) User agent requirements 6) User Interaction 7) Communication 8) HTML Syntax Some of these features can be achieved without any of HTML5, for example 1) use XMLSchema datatypes 2) you don't need HTMLElement: markup insertion, attributes querying can be done using DOM3Core (that in latest browser are even more performant as no parser is involved), events are far better handled by DOM3Events, styling is included by CSSOM you don't need collection either: just use appropriate DOMNodeLists, while for DOMStringMap you may use binding specific features (all Object are hash maps in ECMAScript3): it works this way even in HTML5 3) use XHTML2, which is extensible because modularized 4) metadata is better handled by XHTML2 Meta Attributes module, which fully integrates the RDF module in any elements; structure, sectioning, grouping are the same; text is very similar: you don't have time, but you can have span datatype=xsd:date content=2008-12-21Today/span as in HTML5 you have time value=2008-12-21Today/time; for progress and meter semantic you can use role attribute (for styling you always use CSS); editing is the same, but you have an attribute instead of an element, so you don't have the issue that ins and del can contain everything, even a whole document (not including html); embedding is much more powerful as any element can be replaced by embedded content; tables are the same (you don't have tables API; but you can still use DOM3Core); XForms are actually more powerful than WebForms2, since you divide presentation from data from action (that is implemented declaratively); interactive elements are not needed at all: details is better implemented as it is now (ECMAScript3 + CSS3), datagrid is just a way to put data in a tree model: use plain XML for that; command and a in XHTML2 implemented in any element using href attribute; menu is mostly an ul with some style; scripting uses XMLEvents and handler: it looks the same, but it is different as it is more event oriented (scripts are not executed by default, they're executed when some event fires) 8) HTML syntax: as I said before, use XML for that There are instead features that are indeed very useful to develop a web application, but are not achievable using other means that HTML5: 1) some way to interact with object (please note: object, not embed: object is for plugins, embed for content) : actually this can be done using something like cross document messaging, assuming that object creates a new browsing context (it already does if the target is text/html or application/xhtml+xml), but we need a specification for message syntax 2) the binding specific global scope (that is, what object are available in all scopes, if binding supports this); this is normally the window object, but scripts use certain features only on their own browsing context, so that may be moved from that to global scope, removing the whole window object from scope (for current javascript you can write window.window.window.window.window... and get the same as nothing) 3) the Window object (which includes window name, window location, cross document messaging, dialog windows) 4) Protocol and Content Handlers 5) Session and Local storage 6) Database storage 7) Drag and Drop 8) WebSockets What I am asking now is so to modularize HTML. copy those features into separate, interoperable modules, removing legacy features (like window.on-whatever event listener) A copy of those will remain in HTML5, because browser implement them at the moment, and the HTML5 goal is that all browser
Re: [whatwg] Thoughts on HTML 5
On 21/12/08 17:22, Nils Dagsson Moskopp wrote: Am Sonntag, den 21.12.2008, 17:54 +0100 schrieb Philipp Kempgen: Ian Hickson schrieb: Deprecating HTML thus seems like vain effort. (We already tried over the past few years with XHTML 1.x, and it didn't work.) I'd say it _did_ work. :-) I'd say too: The worst abominations have disappeared (for new sites, that is). thefont element, for example, or frames through deprecating them. You're assuming that's an indication of the power of specifications rather than of actual advantages to using CSS or avoiding frames. What mostly failed, and which Hixie is referring to, was the attempt to move the web from a tag soup (text/html) basis to an XML (application/xhtml+xml) basis. Perhaps that's because the advantages of the later were not persuasive. As I've argued elsewhere in the thread, there's money in staying with text/html. Does anyone remembermarquee ? That's a bad example. MARQUEE was never standardized in a specification, so it was never possible to deprecate it. -- Benjamin Hawkes-Lewis
Re: [whatwg] Thoughts on HTML 5
On Sun, Dec 21, 2008 at 10:12 AM, Giovanni Campagna scampa.giova...@gmail.com wrote: Please Note: all the following is my personal humble opinion. parser is involved), events are far better handled by DOM3Events, styling is included by CSSOM Styling is done in css. I don't have time to go into the all the problems with CSSOM here. Shortcomings of the CSSOM 'views' module were discussed on www-style. 'VIews' is not the only CSSOM module that has problems. you don't need collection either: just use appropriate DOMNodeLists, while for DOMStringMap you may use binding specific features (all Object are hash maps in ECMAScript3): it works this way even in HTML5 Where are you getting this information? but scripts use certain features only on their own browsing context, so that may be moved from that to global scope, removing the whole window object from scope (for current javascript you can write window.window.window.window.window... and get the same as nothing) The closest definition to 'nothing' would be the value undefined. I do not know of a browser where - window.window.window === undefined is true by default. I get window. A relevant example would be useful. The closes thing we got to an example of invalid html is TJ post about jquery validation plugin. If you click throuh, there is an demo using a minlength custom attribute. The attribute may have the effect the author wanted it to have in a set of browses he is concerned about. That effect and the set of browsers could be more clearly demonstrated in an example that shows only that, as well as edge cases where results may vary. If you can't define clearly what can be reasonably expected of a piece of (invalid) code, then nothing can be reasonably expected of it. It's not a good to write code that can't have an expected outcome. Best regards, Giovanni Campagna
Re: [whatwg] Byte-wise tokenization algorithm
On Sun, 21 Dec 2008, Edward Z. Yang wrote: I suppose the big pivot point is as if. A byte-wise implementation would replace character globally with byte, and any U+ designation with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not the actual algorithm implementation, no? Right; conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'