Yes, I understand how the HTML tokenizer works. :) Adam
On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff <msab...@apple.com> wrote: > The various tokenizers / lexers work various ways to handle LChar versus > UChar input streams. Most of the other tokenizers are templatized on input > character type. In the case of HTML, the tokenizer handles a UChar character > at a time. For 8 bit input streams, the zero extension of a LChar to a UChar > is zero cost. There may be additional performance to be gained by doing all > other possible handling in 8 bits, but an 8 bit stream can still contain > escapes that need a UChar representation as you point out. Using a character > type template approach was deemed to be too unwieldy for the HTML tokenizer. > The HTML tokenizer uses SegmentedString's that can consist of sub strings > with either LChar and UChar. That is where the LChar to UChar zero extension > happens for an 8 bit sub string. > > My research showed that at the time showed that there were very few UTF-16 > only resources (<<5% IIRC), although I expect the number to grow. > > - Michael > > > On Mar 7, 2013, at 11:11 AM, Adam Barth <aba...@webkit.org> wrote: > >> The HTMLTokenizer still works in UChars. There's likely some >> performance to be gained by moving it to an 8-bit character type. >> There's some trickiness involved because HTML entities can expand to >> characters outside of Latin-1. Also, it's unclear if we want two >> tokenizers (one that's 8 bits wide and another that's 16 bits wide) or >> if we should find a way for the 8-bit tokenizer to handle, for >> example, UTF-16 encoded network responses. >> >> Adam >> >> >> On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler <da...@apple.com> wrote: >>> No. I retract my question. Sounds like we already have it right! thanks for >>> setting me straight. >>> >>> Maybe some day we could make a non copying code path that points directly >>> at the data in the SharedBuffer, but I have no idea if that'd be beneficial. >>> >>> -- Darin >>> >>> Sent from my iPhone >>> >>> On Mar 7, 2013, at 10:01 AM, Michael Saboff <msab...@apple.com> wrote: >>> >>>> There is an all-ASCII case in TextCodecUTF8::decode(). It should be >>>> keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not >>>> only an all-ASCII case, but it only up converts to 16 bit in a couple of >>>> rare cases. Is there some other case you don't think we are handling? >>>> >>>> - Michael >>>> >>>> On Mar 7, 2013, at 9:29 AM, Darin Adler <da...@apple.com> wrote: >>>> >>>>> Hi folks. >>>>> >>>>> Today, bytes that come in from the network get turned into UTF-16 by the >>>>> decoding process. We then turn some of them back into Latin-1 during the >>>>> parsing process. Should we make changes so there’s an 8-bit path? It >>>>> might be as simple as writing code that has more of an all-ASCII special >>>>> case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1. >>>>> >>>>> Is there something significant to be gained here? I’ve been wondering >>>>> this for a while, so I thought I’d ask the rest of the WebKit >>>>> contributors. >>>>> >>>>> -- Darin >>>>> _______________________________________________ >>>>> webkit-dev mailing list >>>>> webkit-dev@lists.webkit.org >>>>> https://lists.webkit.org/mailman/listinfo/webkit-dev >>>> >>> _______________________________________________ >>> webkit-dev mailing list >>> webkit-dev@lists.webkit.org >>> https://lists.webkit.org/mailman/listinfo/webkit-dev > _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev