Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Michael Saboff
Maciej, *I* deemed using a character type template for the HTMLTokenizer as being unwieldy. Given there was the existing SegmentedString input abstraction, it made logical sense to put the 8/16 bit coding there. If I would have moved the 8/16 logic into the tokenizer itself, we might have

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Adam Barth
Oh, Ok. I misunderstood your original message to say that the project as a whole had reached this conclusion, which certainly isn't the case, rather than that you personally had reached that conclusion. As for the long-term direction of the HTML parser, my guess is that the optimum design will

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Darin Adler
On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote: As for the long-term direction of the HTML parser, my guess is that the optimum design will be to deliver the network bytes to the parser directly on the parser thread. Sounds right to me. If you're about to reply

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Michael Saboff
No complaints with the long term direction. I agree that it is a tall order to implement. - Michael On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote: Oh, Ok. I misunderstood your original message to say that the project as a whole had reached this conclusion, which certainly

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Adam Barth
On Mon, Mar 11, 2013 at 9:56 AM, Darin Adler da...@apple.com wrote: On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote: If you're about to reply complaining about the above, please save your complaints for another time. Huh? The last time we tried to talk about changing the

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Luis de Bethencourt
On Mar 7, 2013 10:37 PM, Brady Eidson beid...@apple.com wrote: On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Adam Barth
On Sat, Mar 9, 2013 at 12:48 PM, Luis de Bethencourt l...@debethencourt.com wrote: On Mar 7, 2013 10:37 PM, Brady Eidson beid...@apple.com wrote: On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Maciej Stachowiak
On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote: In retrospect, I think what I was reacting to was msaboff statement that an unnamed group of people had decided that the HTML tokenizer was too unwieldy to have a dedicated 8-bit path. In particular, it's unclear to me who

[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Darin Adler
Hi folks. Today, bytes that come in from the network get turned into UTF-16 by the decoding process. We then turn some of them back into Latin-1 during the parsing process. Should we make changes so there’s an 8-bit path? It might be as simple as writing code that has more of an all-ASCII

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Michael Saboff
There is an all-ASCII case in TextCodecUTF8::decode(). It should be keeping all ASCII data as 8 bit. TextCodecWindowsLatin1::decode() has not only an all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. Is there some other case you don't think we are handling? -

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Darin Adler
No. I retract my question. Sounds like we already have it right! thanks for setting me straight. Maybe some day we could make a non copying code path that points directly at the data in the SharedBuffer, but I have no idea if that'd be beneficial. -- Darin Sent from my iPhone On Mar 7,

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Adam Barth
The HTMLTokenizer still works in UChars. There's likely some performance to be gained by moving it to an 8-bit character type. There's some trickiness involved because HTML entities can expand to characters outside of Latin-1. Also, it's unclear if we want two tokenizers (one that's 8 bits wide

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Michael Saboff
The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar character at a time. For 8 bit input streams, the zero extension of a LChar to

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Adam Barth
Yes, I understand how the HTML tokenizer works. :) Adam On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character

Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Brady Eidson
On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote: The various tokenizers / lexers work various ways to handle LChar versus UChar input streams. Most of the other tokenizers are templatized on input character type. In the case of HTML, the tokenizer handles a UChar