Yes, I understand how the HTML tokenizer works.  :)

Adam


On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff <msab...@apple.com> wrote:
> The various tokenizers / lexers work various ways to handle LChar versus 
> UChar input streams.  Most of the other tokenizers are templatized on input 
> character type. In the case of HTML, the tokenizer handles a UChar character 
> at a time.  For 8 bit input streams, the zero extension of a LChar to a UChar 
> is zero cost.  There may be additional performance to be gained by doing all 
> other possible handling in 8 bits, but an 8 bit stream can still contain 
> escapes that need a UChar representation as you point out.  Using a character 
> type template approach was deemed to be too unwieldy for the HTML tokenizer.  
> The HTML tokenizer uses SegmentedString's that can consist of sub strings 
> with either LChar and UChar.  That is where the LChar to UChar zero extension 
> happens for an 8 bit sub string.
>
> My research showed that at the time showed that there were very few UTF-16 
> only resources (<<5% IIRC), although I expect the number to grow.
>
> - Michael
>
>
> On Mar 7, 2013, at 11:11 AM, Adam Barth <aba...@webkit.org> wrote:
>
>> The HTMLTokenizer still works in UChars.  There's likely some
>> performance to be gained by moving it to an 8-bit character type.
>> There's some trickiness involved because HTML entities can expand to
>> characters outside of Latin-1. Also, it's unclear if we want two
>> tokenizers (one that's 8 bits wide and another that's 16 bits wide) or
>> if we should find a way for the 8-bit tokenizer to handle, for
>> example, UTF-16 encoded network responses.
>>
>> Adam
>>
>>
>> On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler <da...@apple.com> wrote:
>>> No. I retract my question. Sounds like we already have it right! thanks for 
>>> setting me straight.
>>>
>>> Maybe some day we could make a non copying code path that points directly 
>>> at the data in the SharedBuffer, but I have no idea if that'd be beneficial.
>>>
>>> -- Darin
>>>
>>> Sent from my iPhone
>>>
>>> On Mar 7, 2013, at 10:01 AM, Michael Saboff <msab...@apple.com> wrote:
>>>
>>>> There is an all-ASCII case in TextCodecUTF8::decode().  It should be 
>>>> keeping all ASCII data as 8 bit.  TextCodecWindowsLatin1::decode() has not 
>>>> only an all-ASCII case, but it only up converts to 16 bit in a couple of 
>>>> rare cases.  Is there some other case you don't think we are handling?
>>>>
>>>> - Michael
>>>>
>>>> On Mar 7, 2013, at 9:29 AM, Darin Adler <da...@apple.com> wrote:
>>>>
>>>>> Hi folks.
>>>>>
>>>>> Today, bytes that come in from the network get turned into UTF-16 by the 
>>>>> decoding process. We then turn some of them back into Latin-1 during the 
>>>>> parsing process. Should we make changes so there’s an 8-bit path? It 
>>>>> might be as simple as writing code that has more of an all-ASCII special 
>>>>> case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1.
>>>>>
>>>>> Is there something significant to be gained here? I’ve been wondering 
>>>>> this for a while, so I thought I’d ask the rest of the WebKit 
>>>>> contributors.
>>>>>
>>>>> -- Darin
>>>>> _______________________________________________
>>>>> webkit-dev mailing list
>>>>> webkit-dev@lists.webkit.org
>>>>> https://lists.webkit.org/mailman/listinfo/webkit-dev
>>>>
>>> _______________________________________________
>>> webkit-dev mailing list
>>> webkit-dev@lists.webkit.org
>>> https://lists.webkit.org/mailman/listinfo/webkit-dev
>
_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev

Reply via email to