> On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff <msab...@apple.com> wrote:
>> The various tokenizers / lexers work various ways to handle LChar versus 
>> UChar input streams.  Most of the other tokenizers are templatized on input 
>> character type. In the case of HTML, the tokenizer handles a UChar character 
>> at a time.  For 8 bit input streams, the zero extension of a LChar to a 
>> UChar is zero cost.  There may be additional performance to be gained by 
>> doing all other possible handling in 8 bits, but an 8 bit stream can still 
>> contain escapes that need a UChar representation as you point out.  Using a 
>> character type template approach was deemed to be too unwieldy for the HTML 
>> tokenizer.  The HTML tokenizer uses SegmentedString's that can consist of 
>> sub strings with either LChar and UChar.  That is where the LChar to UChar 
>> zero extension happens for an 8 bit sub string.
>> 
>> My research showed that at the time showed that there were very few UTF-16 
>> only resources (<<5% IIRC), although I expect the number to grow.

On Mar 7, 2013, at 2:16 PM, Adam Barth <aba...@webkit.org> wrote:
> Yes, I understand how the HTML tokenizer works.  :)

I didn't understand these details, and I really appreciate Michael describing 
them.  I'm also glad others on the mailing list had an opportunity to get 
something out of this.

~Brady

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev

Reply via email to