Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon
On 21 Dec 2008, at 05:41, Ian Hickson wrote: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Philip Taylor
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson i...@hixie.ch wrote: On Sat, 20 Dec 2008, Edward Z. Yang wrote: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Geoffrey Sneddon
On 21 Dec 2008, at 16:35, Edward Z. Yang wrote: I suppose the big pivot point is as if. A byte-wise implementation would replace character globally with byte, and any U+ designation with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not the actual algorithm implementation,

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-21 Thread Ian Hickson
On Sun, 21 Dec 2008, Edward Z. Yang wrote: I suppose the big pivot point is as if. A byte-wise implementation would replace character globally with byte, and any U+ designation with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not the actual algorithm implementation,

Re: [whatwg] Byte-wise tokenization algorithm

2008-12-20 Thread Ian Hickson
On Sat, 20 Dec 2008, Edward Z. Yang wrote: I am currently working on a PHP5 implementation of the HTML5 specification. PHP has abysmal Unicode support, and implementing Unicode streams in userspace may be unacceptablu slow. Thus, my questions: 1. Given an input stream that is known to be