On 21 Dec 2008, at 05:41, Ian Hickson wrote:
1. Given an input stream that is known to be valid UTF-8, is it
possible
to implement the tokenization algorithm with byte-wise operations
only?
I think it's possible, since all of the character matching parts of
the
algorithm map to
On Sun, Dec 21, 2008 at 5:41 AM, Ian Hickson i...@hixie.ch wrote:
On Sat, 20 Dec 2008, Edward Z. Yang wrote:
1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the tokenization algorithm with byte-wise operations only?
I think it's possible, since all of the
On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:
I suppose the big pivot point is as if. A byte-wise implementation
would replace character globally with byte, and any U+ designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation,
On Sun, 21 Dec 2008, Edward Z. Yang wrote:
I suppose the big pivot point is as if. A byte-wise implementation
would replace character globally with byte, and any U+ designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation,
On Sat, 20 Dec 2008, Edward Z. Yang wrote:
I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:
1. Given an input stream that is known to be