> On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff <msab...@apple.com> wrote: >> The various tokenizers / lexers work various ways to handle LChar versus >> UChar input streams. Most of the other tokenizers are templatized on input >> character type. In the case of HTML, the tokenizer handles a UChar character >> at a time. For 8 bit input streams, the zero extension of a LChar to a >> UChar is zero cost. There may be additional performance to be gained by >> doing all other possible handling in 8 bits, but an 8 bit stream can still >> contain escapes that need a UChar representation as you point out. Using a >> character type template approach was deemed to be too unwieldy for the HTML >> tokenizer. The HTML tokenizer uses SegmentedString's that can consist of >> sub strings with either LChar and UChar. That is where the LChar to UChar >> zero extension happens for an 8 bit sub string. >> >> My research showed that at the time showed that there were very few UTF-16 >> only resources (<<5% IIRC), although I expect the number to grow.
On Mar 7, 2013, at 2:16 PM, Adam Barth <aba...@webkit.org> wrote: > Yes, I understand how the HTML tokenizer works. :) I didn't understand these details, and I really appreciate Michael describing them. I'm also glad others on the mailing list had an opportunity to get something out of this. ~Brady _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev