Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
Maciej, *I* deemed using a character type template for the HTMLTokenizer as being unwieldy. Given there was the existing SegmentedString input abstraction, it made logical sense to put the 8/16 bit coding there. If I would have moved the 8/16 logic into the tokenizer itself, we might have needed to do 8-16 up conversions when a SegmentedStrings had mixed bit-ness in the contained substrings. Even if that wasn't the case, the patch would have been far larger and likely include tricky code for escapes. As I got into the middle of the 8-bit strings, I realized that not only could I keep performance parity, but some of the techniques I came up with offered good performance improvement. The HTMLTokenizer ended up being one of those cases. This patch required a couple of reworks for performance reasons and garnered a lot of discussion from various parts of the webkit community. See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail. Ryosuke noted that this patch was responsible for a 24% improvement in the url-parser test in their bots (comment 47). My performance final results are in comment 43 and show between 1 and 9% progression on the various HTML parser tests. Adam, If you believe there is more work to be done in the HTMLTokenizer, file a bug and cc me. I'm interested in hearing your thoughts. - Michael On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak m...@apple.com wrote: On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote: In retrospect, I think what I was reacting to was msaboff statement that an unnamed group of people had decided that the HTML tokenizer was too unwieldy to have a dedicated 8-bit path. In particular, it's unclear to me who made that decision. I certainly do not consider the matter decided. It would be good to find out who it was that said that (or more specifically: Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer.) so you can talk to them about it. Michael? Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
Oh, Ok. I misunderstood your original message to say that the project as a whole had reached this conclusion, which certainly isn't the case, rather than that you personally had reached that conclusion. As for the long-term direction of the HTML parser, my guess is that the optimum design will be to deliver the network bytes to the parser directly on the parser thread. On the parser thread, we can merge charset decoding, input stream pre-processing, and tokenization to move directly from network bytes to CompactHTMLTokens. That approach removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit conversions. Parsing directly into CompactHTMLTokens also means we won't have to do any copies or conversions at all for well-known strings (e.g., div and friends from HTMLNames). If you're about to reply complaining about the above, please save your complaints for another time. I realize that some parts of that design will be difficult or impossible to implement on some ports due to limitations on how then interact with their networking stack. In any case, I don't plan to implement that design anytime soon, and I'm sure we'll have plenty of time to discuss its merits in the future. Adam On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff msab...@apple.com wrote: Maciej, *I* deemed using a character type template for the HTMLTokenizer as being unwieldy. Given there was the existing SegmentedString input abstraction, it made logical sense to put the 8/16 bit coding there. If I would have moved the 8/16 logic into the tokenizer itself, we might have needed to do 8-16 up conversions when a SegmentedStrings had mixed bit-ness in the contained substrings. Even if that wasn't the case, the patch would have been far larger and likely include tricky code for escapes. As I got into the middle of the 8-bit strings, I realized that not only could I keep performance parity, but some of the techniques I came up with offered good performance improvement. The HTMLTokenizer ended up being one of those cases. This patch required a couple of reworks for performance reasons and garnered a lot of discussion from various parts of the webkit community. See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail. Ryosuke noted that this patch was responsible for a 24% improvement in the url-parser test in their bots (comment 47). My performance final results are in comment 43 and show between 1 and 9% progression on the various HTML parser tests. Adam, If you believe there is more work to be done in the HTMLTokenizer, file a bug and cc me. I'm interested in hearing your thoughts. - Michael On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak m...@apple.com wrote: On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote: In retrospect, I think what I was reacting to was msaboff statement that an unnamed group of people had decided that the HTML tokenizer was too unwieldy to have a dedicated 8-bit path. In particular, it's unclear to me who made that decision. I certainly do not consider the matter decided. It would be good to find out who it was that said that (or more specifically: Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer.) so you can talk to them about it. Michael? Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote: As for the long-term direction of the HTML parser, my guess is that the optimum design will be to deliver the network bytes to the parser directly on the parser thread. Sounds right to me. If you're about to reply complaining about the above, please save your complaints for another time. Huh? -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
No complaints with the long term direction. I agree that it is a tall order to implement. - Michael On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote: Oh, Ok. I misunderstood your original message to say that the project as a whole had reached this conclusion, which certainly isn't the case, rather than that you personally had reached that conclusion. As for the long-term direction of the HTML parser, my guess is that the optimum design will be to deliver the network bytes to the parser directly on the parser thread. On the parser thread, we can merge charset decoding, input stream pre-processing, and tokenization to move directly from network bytes to CompactHTMLTokens. That approach removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit conversions. Parsing directly into CompactHTMLTokens also means we won't have to do any copies or conversions at all for well-known strings (e.g., div and friends from HTMLNames). If you're about to reply complaining about the above, please save your complaints for another time. I realize that some parts of that design will be difficult or impossible to implement on some ports due to limitations on how then interact with their networking stack. In any case, I don't plan to implement that design anytime soon, and I'm sure we'll have plenty of time to discuss its merits in the future. Adam On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff msab...@apple.com wrote: Maciej, *I* deemed using a character type template for the HTMLTokenizer as being unwieldy. Given there was the existing SegmentedString input abstraction, it made logical sense to put the 8/16 bit coding there. If I would have moved the 8/16 logic into the tokenizer itself, we might have needed to do 8-16 up conversions when a SegmentedStrings had mixed bit-ness in the contained substrings. Even if that wasn't the case, the patch would have been far larger and likely include tricky code for escapes. As I got into the middle of the 8-bit strings, I realized that not only could I keep performance parity, but some of the techniques I came up with offered good performance improvement. The HTMLTokenizer ended up being one of those cases. This patch required a couple of reworks for performance reasons and garnered a lot of discussion from various parts of the webkit community. See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail. Ryosuke noted that this patch was responsible for a 24% improvement in the url-parser test in their bots (comment 47). My performance final results are in comment 43 and show between 1 and 9% progression on the various HTML parser tests. Adam, If you believe there is more work to be done in the HTMLTokenizer, file a bug and cc me. I'm interested in hearing your thoughts. - Michael On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak m...@apple.com wrote: On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote: In retrospect, I think what I was reacting to was msaboff statement that an unnamed group of people had decided that the HTML tokenizer was too unwieldy to have a dedicated 8-bit path. In particular, it's unclear to me who made that decision. I certainly do not consider the matter decided. It would be good to find out who it was that said that (or more specifically: Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer.) so you can talk to them about it. Michael? Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
On Mon, Mar 11, 2013 at 9:56 AM, Darin Adler da...@apple.com wrote: On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote: If you're about to reply complaining about the above, please save your complaints for another time. Huh? The last time we tried to talk about changing the design of the HTML parser on this mailing list, I got the third degree: https://lists.webkit.org/pipermail/webkit-dev/2013-January/023271.html I just wanted to be clear that I'm not proposing making those changes now and we'll have a chance to discuss the various pros and cons of each step as we consider making them. Adam ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
On Mar 7, 2013 10:37 PM, Brady Eidson beid...@apple.com wrote: On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time. For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost. There may be additional performance to be gained by doing all other possible handling in 8 bits, but an 8 bit stream can still contain escapes that need a UChar representation as you point out. Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer. The HTML tokenizer uses SegmentedString's that can consist of sub strings with either LChar and UChar. That is where the LChar to UChar zero extension happens for an 8 bit sub string. My research showed that at the time showed that there were very few UTF-16 only resources (5% IIRC), although I expect the number to grow. On Mar 7, 2013, at 2:16 PM, Adam Barth aba...@webkit.org wrote: Yes, I understand how the HTML tokenizer works. :) I didn't understand these details, and I really appreciate Michael describing them. I'm also glad others on the mailing list had an opportunity to get something out of this. ~Brady I agree with Brady. I got some interesting learning out of this thread. Always nice to read explanations and documentation about how things work. Valuable content. Luis ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
On Sat, Mar 9, 2013 at 12:48 PM, Luis de Bethencourt l...@debethencourt.com wrote: On Mar 7, 2013 10:37 PM, Brady Eidson beid...@apple.com wrote: On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time. For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost. There may be additional performance to be gained by doing all other possible handling in 8 bits, but an 8 bit stream can still contain escapes that need a UChar representation as you point out. Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer. The HTML tokenizer uses SegmentedString's that can consist of sub strings with either LChar and UChar. That is where the LChar to UChar zero extension happens for an 8 bit sub string. My research showed that at the time showed that there were very few UTF-16 only resources (5% IIRC), although I expect the number to grow. On Mar 7, 2013, at 2:16 PM, Adam Barth aba...@webkit.org wrote: Yes, I understand how the HTML tokenizer works. :) I didn't understand these details, and I really appreciate Michael describing them. I'm also glad others on the mailing list had an opportunity to get something out of this. I agree with Brady. I got some interesting learning out of this thread. Always nice to read explanations and documentation about how things work. Valuable content. In retrospect, I think what I was reacting to was msaboff statement that an unnamed group of people had decided that the HTML tokenizer was too unwieldy to have a dedicated 8-bit path. In particular, it's unclear to me who made that decision. I certainly do not consider the matter decided. Adam ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote: In retrospect, I think what I was reacting to was msaboff statement that an unnamed group of people had decided that the HTML tokenizer was too unwieldy to have a dedicated 8-bit path. In particular, it's unclear to me who made that decision. I certainly do not consider the matter decided. It would be good to find out who it was that said that (or more specifically: Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer.) so you can talk to them about it. Michael? Regards, Maciej ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
[webkit-dev] Should we create an 8-bit path from the network stack to the parser?
Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII special case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1. Is there something significant to be gained here? I’ve been wondering this for a while, so I thought I’d ask the rest of the WebKit contributors. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. Is there some other case you don't think we are handling? - Michael On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote: Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII special case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1. Is there something significant to be gained here? I’ve been wondering this for a while, so I thought I’d ask the rest of the WebKit contributors. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
No. I retract my question. Sounds like we already have it right! thanks for setting me straight. Maybe some day we could make a non copying code path that points directly at the data in the SharedBuffer, but I have no idea if that'd be beneficial. -- Darin Sent from my iPhone On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote: There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. Is there some other case you don't think we are handling? - Michael On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote: Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII special case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1. Is there something significant to be gained here? I’ve been wondering this for a while, so I thought I’d ask the rest of the WebKit contributors. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
The HTMLTokenizer still works in UChars. There's likely some performance to be gained by moving it to an 8-bit character type. There's some trickiness involved because HTML entities can expand to characters outside of Latin-1. Also, it's unclear if we want two tokenizers (one that's 8 bits wide and another that's 16 bits wide) or if we should find a way for the 8-bit tokenizer to handle, for example, UTF-16 encoded network responses. Adam On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler da...@apple.com wrote: No. I retract my question. Sounds like we already have it right! thanks for setting me straight. Maybe some day we could make a non copying code path that points directly at the data in the SharedBuffer, but I have no idea if that'd be beneficial. -- Darin Sent from my iPhone On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote: There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. Is there some other case you don't think we are handling? - Michael On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote: Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII special case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1. Is there something significant to be gained here? I’ve been wondering this for a while, so I thought I’d ask the rest of the WebKit contributors. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time. For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost. There may be additional performance to be gained by doing all other possible handling in 8 bits, but an 8 bit stream can still contain escapes that need a UChar representation as you point out. Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer. The HTML tokenizer uses SegmentedString's that can consist of sub strings with either LChar and UChar. That is where the LChar to UChar zero extension happens for an 8 bit sub string. My research showed that at the time showed that there were very few UTF-16 only resources (5% IIRC), although I expect the number to grow. - Michael On Mar 7, 2013, at 11:11 AM, Adam Barth aba...@webkit.org wrote: The HTMLTokenizer still works in UChars. There's likely some performance to be gained by moving it to an 8-bit character type. There's some trickiness involved because HTML entities can expand to characters outside of Latin-1. Also, it's unclear if we want two tokenizers (one that's 8 bits wide and another that's 16 bits wide) or if we should find a way for the 8-bit tokenizer to handle, for example, UTF-16 encoded network responses. Adam On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler da...@apple.com wrote: No. I retract my question. Sounds like we already have it right! thanks for setting me straight. Maybe some day we could make a non copying code path that points directly at the data in the SharedBuffer, but I have no idea if that'd be beneficial. -- Darin Sent from my iPhone On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote: There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. Is there some other case you don't think we are handling? - Michael On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote: Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII special case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1. Is there something significant to be gained here? I’ve been wondering this for a while, so I thought I’d ask the rest of the WebKit contributors. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
Yes, I understand how the HTML tokenizer works. :) Adam On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time. For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost. There may be additional performance to be gained by doing all other possible handling in 8 bits, but an 8 bit stream can still contain escapes that need a UChar representation as you point out. Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer. The HTML tokenizer uses SegmentedString's that can consist of sub strings with either LChar and UChar. That is where the LChar to UChar zero extension happens for an 8 bit sub string. My research showed that at the time showed that there were very few UTF-16 only resources (5% IIRC), although I expect the number to grow. - Michael On Mar 7, 2013, at 11:11 AM, Adam Barth aba...@webkit.org wrote: The HTMLTokenizer still works in UChars. There's likely some performance to be gained by moving it to an 8-bit character type. There's some trickiness involved because HTML entities can expand to characters outside of Latin-1. Also, it's unclear if we want two tokenizers (one that's 8 bits wide and another that's 16 bits wide) or if we should find a way for the 8-bit tokenizer to handle, for example, UTF-16 encoded network responses. Adam On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler da...@apple.com wrote: No. I retract my question. Sounds like we already have it right! thanks for setting me straight. Maybe some day we could make a non copying code path that points directly at the data in the SharedBuffer, but I have no idea if that'd be beneficial. -- Darin Sent from my iPhone On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote: There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. Is there some other case you don't think we are handling? - Michael On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote: Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII special case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1. Is there something significant to be gained here? I’ve been wondering this for a while, so I thought I’d ask the rest of the WebKit contributors. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev
Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?
On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time. For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost. There may be additional performance to be gained by doing all other possible handling in 8 bits, but an 8 bit stream can still contain escapes that need a UChar representation as you point out. Using a character type template approach was deemed to be too unwieldy for the HTML tokenizer. The HTML tokenizer uses SegmentedString's that can consist of sub strings with either LChar and UChar. That is where the LChar to UChar zero extension happens for an 8 bit sub string. My research showed that at the time showed that there were very few UTF-16 only resources (5% IIRC), although I expect the number to grow. On Mar 7, 2013, at 2:16 PM, Adam Barth aba...@webkit.org wrote: Yes, I understand how the HTML tokenizer works. :) I didn't understand these details, and I really appreciate Michael describing them. I'm also glad others on the mailing list had an opportunity to get something out of this. ~Brady ___ webkit-dev mailing list webkit-dev@lists.webkit.org https://lists.webkit.org/mailman/listinfo/webkit-dev