Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon


On 21 Dec 2008, at 05:41, Ian Hickson wrote:

1. Given an input stream that is known to be valid UTF-8, is it  
possible
to implement the tokenization algorithm with byte-wise operations  
only?
I think it's possible, since all of the character matching parts of  
the

algorithm map to characters in ASCII space.


Yes. (At least, that's the intent; if you find anything that  
contradicts

that, please let me know.)


Indeed it is possible (or at least it certainly was a year and a half  
ago, but I have seen nothing change that would stop it).



2. Would such an implementation be conforming?


Looking just at parsing, yes, probably... But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might not  
depending

on exactly how this is done.


That should be no problem: just convert Windows-1252 to UTF-8 using  
strtr() (as it is a SBCS this is simple enough — doing the inverse is  
not) — see the attached file. Then all you need to do is normalize the  
character set name to match all aliases of Windows-1252 and UTF-8, as  
well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to  
Windows-1252. http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php 
 does that (the only dependancy is for getting the file via HTTP,  
that can just be replaced with cURL if you wish to just require that).



--
Geoffrey Sneddon
http://gsnedders.com/
?php

/**
 * Converts a Windows-1252 encoded string to a UTF-8 encoded string	
 *
 * @copyright 2008 Geoffrey Sneddon
 * @license http://www.opensource.org/licenses/bsd-license.php BSD License
 * @param string $string Windows-1252 encoded string
 * @return string UTF-8 encoded string
 */
	
function windows_1252_to_utf8($string)	
{
static $convert_table = array(
\x80 = \xE2\x82\xAC,
\x81 = \xEF\xBF\xBD,
\x82 = \xE2\x80\x9A,
\x83 = \xC6\x92,
\x84 = \xE2\x80\x9E,
\x85 = \xE2\x80\xA6,
\x86 = \xE2\x80\xA0,
\x87 = \xE2\x80\xA1,
\x88 = \xCB\x86,
\x89 = \xE2\x80\xB0,
\x8A = \xC5\xA0,
\x8B = \xE2\x80\xB9,
\x8C = \xC5\x92,
\x8D = \xEF\xBF\xBD,
\x8E = \xC5\xBD,
\x8F = \xEF\xBF\xBD,
\x90 = \xEF\xBF\xBD,
\x91 = \xE2\x80\x98,
\x92 = \xE2\x80\x99,
\x93 = \xE2\x80\x9C,
\x94 = \xE2\x80\x9D,
\x95 = \xE2\x80\xA2,
\x96 = \xE2\x80\x93,
\x97 = \xE2\x80\x94,
\x98 = \xCB\x9C,
\x99 = \xE2\x84\xA2,
\x9A = \xC5\xA1,
\x9B = \xE2\x80\xBA,
\x9C = \xC5\x93,
\x9D = \xEF\xBF\xBD,
\x9E = \xC5\xBE,
\x9F = \xC5\xB8,
\xA0 = \xC2\xA0,
\xA1 = \xC2\xA1,
\xA2 = \xC2\xA2,
\xA3 = \xC2\xA3,
\xA4 = \xC2\xA4,
\xA5 = \xC2\xA5,
\xA6 = \xC2\xA6,
\xA7 = \xC2\xA7,
\xA8 = \xC2\xA8,
\xA9 = \xC2\xA9,
\xAA = \xC2\xAA,
\xAB = \xC2\xAB,
\xAC = \xC2\xAC,
\xAD = \xC2\xAD,
\xAE = \xC2\xAE,
\xAF = \xC2\xAF,
\xB0 = \xC2\xB0,
\xB1 = \xC2\xB1,
\xB2 = \xC2\xB2,
\xB3 = \xC2\xB3,
\xB4 = \xC2\xB4,
\xB5 = \xC2\xB5,
\xB6 = \xC2\xB6,
\xB7 = \xC2\xB7,
\xB8 = \xC2\xB8,
\xB9 = \xC2\xB9,
\xBA = \xC2\xBA,
\xBB = \xC2\xBB,
\xBC = \xC2\xBC,
\xBD = \xC2\xBD,
\xBE = \xC2\xBE,
\xBF = \xC2\xBF,
\xC0 = \xC3\x80,
\xC1 = \xC3\x81,
\xC2 = \xC3\x82,
\xC3 = \xC3\x83,
\xC4 = \xC3\x84,
\xC5 = \xC3\x85,
\xC6 = \xC3\x86,
\xC7 = \xC3\x87,
\xC8 = \xC3\x88,
\xC9 = \xC3\x89,
\xCA = \xC3\x8A,
\xCB = \xC3\x8B,
\xCC = \xC3\x8C,
\xCD = \xC3\x8D,
\xCE = \xC3\x8E,
\xCF = \xC3\x8F,
\xD0 = \xC3\x90,
\xD1 = \xC3\x91,
\xD2 = \xC3\x92,
\xD3 = \xC3\x93,
\xD4 = \xC3\x94,
\xD5 = \xC3\x95,
\xD6 = \xC3\x96,
\xD7 = \xC3\x97,
\xD8 = \xC3\x98,
\xD9 = \xC3\x99,
\xDA = \xC3\x9A,
\xDB = \xC3\x9B,
\xDC = \xC3\x9C,
\xDD = \xC3\x9D,
\xDE = \xC3\x9E,
\xDF = \xC3\x9F,
\xE0 = \xC3\xA0,
\xE1 = \xC3\xA1,
\xE2 = \xC3\xA2,
\xE3 = \xC3\xA3,
\xE4 = \xC3\xA4,
\xE5 = \xC3\xA5,
\xE6 = \xC3\xA6,
\xE7 = \xC3\xA7,
\xE8 = \xC3\xA8,
\xE9 = \xC3\xA9,
\xEA = \xC3\xAA,
\xEB = \xC3\xAB,
\xEC = \xC3\xAC,
\xED = \xC3\xAD,
\xEE = \xC3\xAE,
\xEF = \xC3\xAF,
\xF0 = \xC3\xB0,
\xF1 = \xC3\xB1,
\xF2 = \xC3\xB2,
\xF3 = \xC3\xB3,
\xF4 = \xC3\xB4,
\xF5 = \xC3\xB5,
\xF6 = \xC3\xB6,
\xF7 = \xC3\xB7,
\xF8 = \xC3\xB8,
\xF9 = \xC3\xB9,
\xFA = \xC3\xBA,
 

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Philip Taylor
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson i...@hixie.ch wrote:
 On Sat, 20 Dec 2008, Edward Z. Yang wrote:

 1. Given an input stream that is known to be valid UTF-8, is it possible
 to implement the tokenization algorithm with byte-wise operations only?
 I think it's possible, since all of the character matching parts of the
 algorithm map to characters in ASCII space.

 Yes. (At least, that's the intent; if you find anything that contradicts
 that, please let me know.)

I think there are some cases where it still should work but you might
have to be a little careful - e.g. tablefoo notionally results in
three parse errors according to the spec (one for each character token
which gets foster-parented), so table☹ results in one if you work
with Unicode characters but three if you treat each UTF-8 byte as a
separate character token.

But in practice, tokenisers emit sequence-of-many-characters tokens
instead of single-character tokens, so they only emit one parse error
for tablefoo, and the html5lib test cases assume that behaviour,
and it should work identically if you have sequence-of-many-bytes
tokens instead.

(Apparently only the distinction between 0 and more-than-0 parse
errors is important as far as the spec is concerned, since that has an
effect on whether the document is conforming; but it seems useful for
implementors to share test cases that are precise about exactly where
all the parse errors are emitted, since that helps find bugs, and so
the parse error count is relevant.)

-- 
Philip Taylor
exc...@gmail.com


Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon


On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:


I suppose the big pivot point is as if. A byte-wise implementation
would replace character globally with byte, and any U+ designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?


It states that what is done must be wholly equivalent to the given  
algorithm.



But an HTML5 implementation,
according to the spec, must at a minimum support the UTF-8 and
Windows-1252 encodings, so the overall implementation might not  
depending

on exactly how this is done.


The plan is to convert Windows-1252 into UTF-8 before processing;  
with a

reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.


I've never seen any way of getting iconv (at least via PHP) to do what  
HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is,  
however, possible using mbstring (which also has the advantage of not  
being system dependant), as well as with PHP6's Unicode support.



--
Geoffrey Sneddon
http://gsnedders.com/



Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Ian Hickson
On Sun, 21 Dec 2008, Edward Z. Yang wrote:
 
 I suppose the big pivot point is as if. A byte-wise implementation 
 would replace character globally with byte, and any U+ designation 
 with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not 
 the actual algorithm implementation, no?

Right; conformance requirements phrased as algorithms or specific steps 
may be implemented in any manner, so long as the end result is equivalent.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


[whatwg] Byte-wise tokenization algorithm

2008-12-20 Thread Edward Z. Yang
I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:

1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the tokenization algorithm with byte-wise operations only?
I think it's possible, since all of the character matching parts of the
algorithm map to characters in ASCII space.

2. Would such an implementation be conforming?

Cheers,
Edward


Re: [whatwg] Byte-wise tokenization algorithm

2008-12-20 Thread Ian Hickson
On Sat, 20 Dec 2008, Edward Z. Yang wrote:

 I am currently working on a PHP5 implementation of the HTML5 
 specification. PHP has abysmal Unicode support, and implementing Unicode 
 streams in userspace may be unacceptablu slow. Thus, my questions:
 
 1. Given an input stream that is known to be valid UTF-8, is it possible 
 to implement the tokenization algorithm with byte-wise operations only? 
 I think it's possible, since all of the character matching parts of the 
 algorithm map to characters in ASCII space.

Yes. (At least, that's the intent; if you find anything that contradicts 
that, please let me know.)


 2. Would such an implementation be conforming?

Looking just at parsing, yes, probably... But an HTML5 implementation, 
according to the spec, must at a minimum support the UTF-8 and 
Windows-1252 encodings, so the overall implementation might not depending 
on exactly how this is done.

HTH,
-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'