Re: [whatwg] Byte-wise tokenization algorithm
On 21 Dec 2008, at 05:41, Ian Hickson wrote: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters in ASCII space. Yes. (At least, that's the intent; if you find anything that contradicts that, please let me know.) Indeed it is possible (or at least it certainly was a year and a half ago, but I have seen nothing change that would stop it). 2. Would such an implementation be conforming? Looking just at parsing, yes, probably... But an HTML5 implementation, according to the spec, must at a minimum support the UTF-8 and Windows-1252 encodings, so the overall implementation might not depending on exactly how this is done. That should be no problem: just convert Windows-1252 to UTF-8 using strtr() (as it is a SBCS this is simple enough — doing the inverse is not) — see the attached file. Then all you need to do is normalize the character set name to match all aliases of Windows-1252 and UTF-8, as well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to Windows-1252. http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php does that (the only dependancy is for getting the file via HTTP, that can just be replaced with cURL if you wish to just require that). -- Geoffrey Sneddon http://gsnedders.com/ ?php /** * Converts a Windows-1252 encoded string to a UTF-8 encoded string * * @copyright 2008 Geoffrey Sneddon * @license http://www.opensource.org/licenses/bsd-license.php BSD License * @param string $string Windows-1252 encoded string * @return string UTF-8 encoded string */ function windows_1252_to_utf8($string) { static $convert_table = array( \x80 = \xE2\x82\xAC, \x81 = \xEF\xBF\xBD, \x82 = \xE2\x80\x9A, \x83 = \xC6\x92, \x84 = \xE2\x80\x9E, \x85 = \xE2\x80\xA6, \x86 = \xE2\x80\xA0, \x87 = \xE2\x80\xA1, \x88 = \xCB\x86, \x89 = \xE2\x80\xB0, \x8A = \xC5\xA0, \x8B = \xE2\x80\xB9, \x8C = \xC5\x92, \x8D = \xEF\xBF\xBD, \x8E = \xC5\xBD, \x8F = \xEF\xBF\xBD, \x90 = \xEF\xBF\xBD, \x91 = \xE2\x80\x98, \x92 = \xE2\x80\x99, \x93 = \xE2\x80\x9C, \x94 = \xE2\x80\x9D, \x95 = \xE2\x80\xA2, \x96 = \xE2\x80\x93, \x97 = \xE2\x80\x94, \x98 = \xCB\x9C, \x99 = \xE2\x84\xA2, \x9A = \xC5\xA1, \x9B = \xE2\x80\xBA, \x9C = \xC5\x93, \x9D = \xEF\xBF\xBD, \x9E = \xC5\xBE, \x9F = \xC5\xB8, \xA0 = \xC2\xA0, \xA1 = \xC2\xA1, \xA2 = \xC2\xA2, \xA3 = \xC2\xA3, \xA4 = \xC2\xA4, \xA5 = \xC2\xA5, \xA6 = \xC2\xA6, \xA7 = \xC2\xA7, \xA8 = \xC2\xA8, \xA9 = \xC2\xA9, \xAA = \xC2\xAA, \xAB = \xC2\xAB, \xAC = \xC2\xAC, \xAD = \xC2\xAD, \xAE = \xC2\xAE, \xAF = \xC2\xAF, \xB0 = \xC2\xB0, \xB1 = \xC2\xB1, \xB2 = \xC2\xB2, \xB3 = \xC2\xB3, \xB4 = \xC2\xB4, \xB5 = \xC2\xB5, \xB6 = \xC2\xB6, \xB7 = \xC2\xB7, \xB8 = \xC2\xB8, \xB9 = \xC2\xB9, \xBA = \xC2\xBA, \xBB = \xC2\xBB, \xBC = \xC2\xBC, \xBD = \xC2\xBD, \xBE = \xC2\xBE, \xBF = \xC2\xBF, \xC0 = \xC3\x80, \xC1 = \xC3\x81, \xC2 = \xC3\x82, \xC3 = \xC3\x83, \xC4 = \xC3\x84, \xC5 = \xC3\x85, \xC6 = \xC3\x86, \xC7 = \xC3\x87, \xC8 = \xC3\x88, \xC9 = \xC3\x89, \xCA = \xC3\x8A, \xCB = \xC3\x8B, \xCC = \xC3\x8C, \xCD = \xC3\x8D, \xCE = \xC3\x8E, \xCF = \xC3\x8F, \xD0 = \xC3\x90, \xD1 = \xC3\x91, \xD2 = \xC3\x92, \xD3 = \xC3\x93, \xD4 = \xC3\x94, \xD5 = \xC3\x95, \xD6 = \xC3\x96, \xD7 = \xC3\x97, \xD8 = \xC3\x98, \xD9 = \xC3\x99, \xDA = \xC3\x9A, \xDB = \xC3\x9B, \xDC = \xC3\x9C, \xDD = \xC3\x9D, \xDE = \xC3\x9E, \xDF = \xC3\x9F, \xE0 = \xC3\xA0, \xE1 = \xC3\xA1, \xE2 = \xC3\xA2, \xE3 = \xC3\xA3, \xE4 = \xC3\xA4, \xE5 = \xC3\xA5, \xE6 = \xC3\xA6, \xE7 = \xC3\xA7, \xE8 = \xC3\xA8, \xE9 = \xC3\xA9, \xEA = \xC3\xAA, \xEB = \xC3\xAB, \xEC = \xC3\xAC, \xED = \xC3\xAD, \xEE = \xC3\xAE, \xEF = \xC3\xAF, \xF0 = \xC3\xB0, \xF1 = \xC3\xB1, \xF2 = \xC3\xB2, \xF3 = \xC3\xB3, \xF4 = \xC3\xB4, \xF5 = \xC3\xB5, \xF6 = \xC3\xB6, \xF7 = \xC3\xB7, \xF8 = \xC3\xB8, \xF9 = \xC3\xB9, \xFA = \xC3\xBA,
Re: [whatwg] Byte-wise tokenization algorithm
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson i...@hixie.ch wrote: On Sat, 20 Dec 2008, Edward Z. Yang wrote: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters in ASCII space. Yes. (At least, that's the intent; if you find anything that contradicts that, please let me know.) I think there are some cases where it still should work but you might have to be a little careful - e.g. tablefoo notionally results in three parse errors according to the spec (one for each character token which gets foster-parented), so table☹ results in one if you work with Unicode characters but three if you treat each UTF-8 byte as a separate character token. But in practice, tokenisers emit sequence-of-many-characters tokens instead of single-character tokens, so they only emit one parse error for tablefoo, and the html5lib test cases assume that behaviour, and it should work identically if you have sequence-of-many-bytes tokens instead. (Apparently only the distinction between 0 and more-than-0 parse errors is important as far as the spec is concerned, since that has an effect on whether the document is conforming; but it seems useful for implementors to share test cases that are precise about exactly where all the parse errors are emitted, since that helps find bugs, and so the parse error count is relevant.) -- Philip Taylor exc...@gmail.com
Re: [whatwg] Byte-wise tokenization algorithm
On 21 Dec 2008, at 16:35, Edward Z. Yang wrote: I suppose the big pivot point is as if. A byte-wise implementation would replace character globally with byte, and any U+ designation with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not the actual algorithm implementation, no? It states that what is done must be wholly equivalent to the given algorithm. But an HTML5 implementation, according to the spec, must at a minimum support the UTF-8 and Windows-1252 encodings, so the overall implementation might not depending on exactly how this is done. The plan is to convert Windows-1252 into UTF-8 before processing; with a reasonably good iconv implementation, support for lots of encodings is possible. The implementation might not be fully conforming if iconv doesn't perform the proper (possibly context-sensitive; I haven't checked) substitution when it doesn't recognize a character, but it should be close. I've never seen any way of getting iconv (at least via PHP) to do what HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is, however, possible using mbstring (which also has the advantage of not being system dependant), as well as with PHP6's Unicode support. -- Geoffrey Sneddon http://gsnedders.com/
Re: [whatwg] Byte-wise tokenization algorithm
On Sun, 21 Dec 2008, Edward Z. Yang wrote: I suppose the big pivot point is as if. A byte-wise implementation would replace character globally with byte, and any U+ designation with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not the actual algorithm implementation, no? Right; conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
[whatwg] Byte-wise tokenization algorithm
I am currently working on a PHP5 implementation of the HTML5 specification. PHP has abysmal Unicode support, and implementing Unicode streams in userspace may be unacceptablu slow. Thus, my questions: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters in ASCII space. 2. Would such an implementation be conforming? Cheers, Edward
Re: [whatwg] Byte-wise tokenization algorithm
On Sat, 20 Dec 2008, Edward Z. Yang wrote: I am currently working on a PHP5 implementation of the HTML5 specification. PHP has abysmal Unicode support, and implementing Unicode streams in userspace may be unacceptablu slow. Thus, my questions: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters in ASCII space. Yes. (At least, that's the intent; if you find anything that contradicts that, please let me know.) 2. Would such an implementation be conforming? Looking just at parsing, yes, probably... But an HTML5 implementation, according to the spec, must at a minimum support the UTF-8 and Windows-1252 encodings, so the overall implementation might not depending on exactly how this is done. HTH, -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'