Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Michael Saboff
Maciej,

*I* deemed using a character type template for the HTMLTokenizer as being 
unwieldy.  Given there was the existing SegmentedString input abstraction, it 
made logical sense to put the 8/16 bit coding there.  If I would have moved the 
8/16 logic into the tokenizer itself, we might have needed to do 8-16 up 
conversions when a SegmentedStrings had mixed bit-ness in the contained 
substrings.  Even if that wasn't the case, the patch would have been far larger 
and likely include tricky code for escapes.

As I got into the middle of the 8-bit strings, I realized that not only could I 
keep performance parity, but some of the techniques I came up with offered good 
performance improvement.  The HTMLTokenizer ended up being one of those cases.  
This patch required a couple of reworks for performance reasons and garnered a 
lot of discussion from various parts of the webkit community.  See 
https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail.  Ryosuke noted 
that this patch was responsible for a 24% improvement in the url-parser test in 
their bots (comment 47).  My performance final results are in comment 43 and 
show between 1 and 9% progression on the various HTML parser tests.

Adam, If you believe there is more work to be done in the HTMLTokenizer, file a 
bug and cc me.  I'm interested in hearing your thoughts.

- Michael

On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak m...@apple.com wrote:

 
 On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote:
 
 In retrospect, I think what I was reacting to was msaboff statement
 that an unnamed group of people had decided that the HTML tokenizer
 was too unwieldy to have a dedicated 8-bit path.  In particular, it's
 unclear to me who made that decision.  I certainly do not consider the
 matter decided.
 
 It would be good to find out who it was that said that (or more specifically: 
 Using a character type template approach was deemed to be too unwieldy for 
 the HTML tokenizer.) so you can talk to them about it.
 
 Michael?
 
 Regards,
 Maciej
 

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Adam Barth
Oh, Ok.  I misunderstood your original message to say that the project
as a whole had reached this conclusion, which certainly isn't the
case, rather than that you personally had reached that conclusion.

As for the long-term direction of the HTML parser, my guess is that
the optimum design will be to deliver the network bytes to the parser
directly on the parser thread.  On the parser thread, we can merge
charset decoding, input stream pre-processing, and tokenization to
move directly from network bytes to CompactHTMLTokens.  That approach
removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit
conversions.  Parsing directly into CompactHTMLTokens also means we
won't have to do any copies or conversions at all for well-known
strings (e.g., div and friends from HTMLNames).

If you're about to reply complaining about the above, please save your
complaints for another time.  I realize that some parts of that design
will be difficult or impossible to implement on some ports due to
limitations on how then interact with their networking stack.  In any
case, I don't plan to implement that design anytime soon, and I'm sure
we'll have plenty of time to discuss its merits in the future.

Adam


On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff msab...@apple.com wrote:
 Maciej,

 *I* deemed using a character type template for the HTMLTokenizer as being
 unwieldy.  Given there was the existing SegmentedString input abstraction,
 it made logical sense to put the 8/16 bit coding there.  If I would have
 moved the 8/16 logic into the tokenizer itself, we might have needed to do
 8-16 up conversions when a SegmentedStrings had mixed bit-ness in the
 contained substrings.  Even if that wasn't the case, the patch would have
 been far larger and likely include tricky code for escapes.

 As I got into the middle of the 8-bit strings, I realized that not only
 could I keep performance parity, but some of the techniques I came up with
 offered good performance improvement.  The HTMLTokenizer ended up being one
 of those cases.  This patch required a couple of reworks for performance
 reasons and garnered a lot of discussion from various parts of the webkit
 community.  See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail.
 Ryosuke noted that this patch was responsible for a 24% improvement in the
 url-parser test in their bots (comment 47).  My performance final results
 are in comment 43 and show between 1 and 9% progression on the various HTML
 parser tests.

 Adam, If you believe there is more work to be done in the HTMLTokenizer,
 file a bug and cc me.  I'm interested in hearing your thoughts.

 - Michael

 On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak m...@apple.com wrote:


 On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote:


 In retrospect, I think what I was reacting to was msaboff statement
 that an unnamed group of people had decided that the HTML tokenizer
 was too unwieldy to have a dedicated 8-bit path.  In particular, it's
 unclear to me who made that decision.  I certainly do not consider the
 matter decided.


 It would be good to find out who it was that said that (or more
 specifically: Using a character type template approach was deemed to be too
 unwieldy for the HTML tokenizer.) so you can talk to them about it.

 Michael?

 Regards,
 Maciej


___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Darin Adler
On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote:

 As for the long-term direction of the HTML parser, my guess is that the 
 optimum design will be to deliver the network bytes to the parser directly on 
 the parser thread.

Sounds right to me.

 If you're about to reply complaining about the above, please save your 
 complaints for another time.

Huh?

-- Darin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Michael Saboff
No complaints with the long term direction.  I agree that it is a tall order to 
implement.

- Michael

On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote:

 Oh, Ok.  I misunderstood your original message to say that the project
 as a whole had reached this conclusion, which certainly isn't the
 case, rather than that you personally had reached that conclusion.
 
 As for the long-term direction of the HTML parser, my guess is that
 the optimum design will be to deliver the network bytes to the parser
 directly on the parser thread.  On the parser thread, we can merge
 charset decoding, input stream pre-processing, and tokenization to
 move directly from network bytes to CompactHTMLTokens.  That approach
 removes a number of copies, 8-bit-to-16-bit, and 16-bit-to-8-bit
 conversions.  Parsing directly into CompactHTMLTokens also means we
 won't have to do any copies or conversions at all for well-known
 strings (e.g., div and friends from HTMLNames).
 
 If you're about to reply complaining about the above, please save your
 complaints for another time.  I realize that some parts of that design
 will be difficult or impossible to implement on some ports due to
 limitations on how then interact with their networking stack.  In any
 case, I don't plan to implement that design anytime soon, and I'm sure
 we'll have plenty of time to discuss its merits in the future.
 
 Adam
 
 
 On Mon, Mar 11, 2013 at 8:56 AM, Michael Saboff msab...@apple.com wrote:
 Maciej,
 
 *I* deemed using a character type template for the HTMLTokenizer as being
 unwieldy.  Given there was the existing SegmentedString input abstraction,
 it made logical sense to put the 8/16 bit coding there.  If I would have
 moved the 8/16 logic into the tokenizer itself, we might have needed to do
 8-16 up conversions when a SegmentedStrings had mixed bit-ness in the
 contained substrings.  Even if that wasn't the case, the patch would have
 been far larger and likely include tricky code for escapes.
 
 As I got into the middle of the 8-bit strings, I realized that not only
 could I keep performance parity, but some of the techniques I came up with
 offered good performance improvement.  The HTMLTokenizer ended up being one
 of those cases.  This patch required a couple of reworks for performance
 reasons and garnered a lot of discussion from various parts of the webkit
 community.  See https://bugs.webkit.org/show_bug.cgi?id=90321 for the trail.
 Ryosuke noted that this patch was responsible for a 24% improvement in the
 url-parser test in their bots (comment 47).  My performance final results
 are in comment 43 and show between 1 and 9% progression on the various HTML
 parser tests.
 
 Adam, If you believe there is more work to be done in the HTMLTokenizer,
 file a bug and cc me.  I'm interested in hearing your thoughts.
 
 - Michael
 
 On Mar 9, 2013, at 4:24 PM, Maciej Stachowiak m...@apple.com wrote:
 
 
 On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote:
 
 
 In retrospect, I think what I was reacting to was msaboff statement
 that an unnamed group of people had decided that the HTML tokenizer
 was too unwieldy to have a dedicated 8-bit path.  In particular, it's
 unclear to me who made that decision.  I certainly do not consider the
 matter decided.
 
 
 It would be good to find out who it was that said that (or more
 specifically: Using a character type template approach was deemed to be too
 unwieldy for the HTML tokenizer.) so you can talk to them about it.
 
 Michael?
 
 Regards,
 Maciej
 
 

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-11 Thread Adam Barth
On Mon, Mar 11, 2013 at 9:56 AM, Darin Adler da...@apple.com wrote:
 On Mar 11, 2013, at 9:54 AM, Adam Barth aba...@webkit.org wrote:
 If you're about to reply complaining about the above, please save your 
 complaints for another time.

 Huh?

The last time we tried to talk about changing the design of the HTML
parser on this mailing list, I got the third degree:

https://lists.webkit.org/pipermail/webkit-dev/2013-January/023271.html

I just wanted to be clear that I'm not proposing making those changes
now and we'll have a chance to discuss the various pros and cons of
each step as we consider making them.

Adam
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Luis de Bethencourt
On Mar 7, 2013 10:37 PM, Brady Eidson beid...@apple.com wrote:

  On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com
wrote:
  The various tokenizers / lexers work various ways to handle LChar
versus UChar input streams.  Most of the other tokenizers are templatized
on input character type. In the case of HTML, the tokenizer handles a UChar
character at a time.  For 8 bit input streams, the zero extension of a
LChar to a UChar is zero cost.  There may be additional performance to be
gained by doing all other possible handling in 8 bits, but an 8 bit stream
can still contain escapes that need a UChar representation as you point
out.  Using a character type template approach was deemed to be too
unwieldy for the HTML tokenizer.  The HTML tokenizer uses SegmentedString's
that can consist of sub strings with either LChar and UChar.  That is where
the LChar to UChar zero extension happens for an 8 bit sub string.
 
  My research showed that at the time showed that there were very few
UTF-16 only resources (5% IIRC), although I expect the number to grow.

 On Mar 7, 2013, at 2:16 PM, Adam Barth aba...@webkit.org wrote:
  Yes, I understand how the HTML tokenizer works.  :)

 I didn't understand these details, and I really appreciate Michael
describing them.  I'm also glad others on the mailing list had an
opportunity to get something out of this.

 ~Brady

I agree with Brady. I got some interesting learning out of this thread.
Always nice to read explanations and documentation about how things work.
Valuable content.

Luis


 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Adam Barth
On Sat, Mar 9, 2013 at 12:48 PM, Luis de Bethencourt
l...@debethencourt.com wrote:
 On Mar 7, 2013 10:37 PM, Brady Eidson beid...@apple.com wrote:
  On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com
  wrote:
  The various tokenizers / lexers work various ways to handle LChar
  versus UChar input streams.  Most of the other tokenizers are templatized 
  on
  input character type. In the case of HTML, the tokenizer handles a UChar
  character at a time.  For 8 bit input streams, the zero extension of a 
  LChar
  to a UChar is zero cost.  There may be additional performance to be gained
  by doing all other possible handling in 8 bits, but an 8 bit stream can
  still contain escapes that need a UChar representation as you point out.
  Using a character type template approach was deemed to be too unwieldy for
  the HTML tokenizer.  The HTML tokenizer uses SegmentedString's that can
  consist of sub strings with either LChar and UChar.  That is where the 
  LChar
  to UChar zero extension happens for an 8 bit sub string.
 
  My research showed that at the time showed that there were very few
  UTF-16 only resources (5% IIRC), although I expect the number to grow.

 On Mar 7, 2013, at 2:16 PM, Adam Barth aba...@webkit.org wrote:
  Yes, I understand how the HTML tokenizer works.  :)

 I didn't understand these details, and I really appreciate Michael
 describing them.  I'm also glad others on the mailing list had an
 opportunity to get something out of this.

 I agree with Brady. I got some interesting learning out of this thread.
 Always nice to read explanations and documentation about how things work.
 Valuable content.

In retrospect, I think what I was reacting to was msaboff statement
that an unnamed group of people had decided that the HTML tokenizer
was too unwieldy to have a dedicated 8-bit path.  In particular, it's
unclear to me who made that decision.  I certainly do not consider the
matter decided.

Adam
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-09 Thread Maciej Stachowiak

On Mar 9, 2013, at 3:05 PM, Adam Barth aba...@webkit.org wrote:
 
 In retrospect, I think what I was reacting to was msaboff statement
 that an unnamed group of people had decided that the HTML tokenizer
 was too unwieldy to have a dedicated 8-bit path.  In particular, it's
 unclear to me who made that decision.  I certainly do not consider the
 matter decided.

It would be good to find out who it was that said that (or more specifically: 
Using a character type template approach was deemed to be too unwieldy for the 
HTML tokenizer.) so you can talk to them about it.

Michael?

Regards,
Maciej

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


[webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Darin Adler
Hi folks.

Today, bytes that come in from the network get turned into UTF-16 by the 
decoding process. We then turn some of them back into Latin-1 during the 
parsing process. Should we make changes so there’s an 8-bit path? It might be 
as simple as writing code that has more of an all-ASCII special case in 
TextCodecUTF8 and something similar in TextCodecWindowsLatin1.

Is there something significant to be gained here? I’ve been wondering this for 
a while, so I thought I’d ask the rest of the WebKit contributors.

-- Darin
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Michael Saboff
There is an all-ASCII case in TextCodecUTF8::decode().  It should be keeping 
all ASCII data as 8 bit.  TextCodecWindowsLatin1::decode() has not only an 
all-ASCII case, but it only up converts to 16 bit in a couple of rare cases.  
Is there some other case you don't think we are handling?

- Michael

On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote:

 Hi folks.
 
 Today, bytes that come in from the network get turned into UTF-16 by the 
 decoding process. We then turn some of them back into Latin-1 during the 
 parsing process. Should we make changes so there’s an 8-bit path? It might be 
 as simple as writing code that has more of an all-ASCII special case in 
 TextCodecUTF8 and something similar in TextCodecWindowsLatin1.
 
 Is there something significant to be gained here? I’ve been wondering this 
 for a while, so I thought I’d ask the rest of the WebKit contributors.
 
 -- Darin
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Darin Adler
No. I retract my question. Sounds like we already have it right! thanks for 
setting me straight. 

Maybe some day we could make a non copying code path that points directly at 
the data in the SharedBuffer, but I have no idea if that'd be beneficial. 

-- Darin

Sent from my iPhone

On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote:

 There is an all-ASCII case in TextCodecUTF8::decode().  It should be keeping 
 all ASCII data as 8 bit.  TextCodecWindowsLatin1::decode() has not only an 
 all-ASCII case, but it only up converts to 16 bit in a couple of rare cases.  
 Is there some other case you don't think we are handling?
 
 - Michael
 
 On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote:
 
 Hi folks.
 
 Today, bytes that come in from the network get turned into UTF-16 by the 
 decoding process. We then turn some of them back into Latin-1 during the 
 parsing process. Should we make changes so there’s an 8-bit path? It might 
 be as simple as writing code that has more of an all-ASCII special case in 
 TextCodecUTF8 and something similar in TextCodecWindowsLatin1.
 
 Is there something significant to be gained here? I’ve been wondering this 
 for a while, so I thought I’d ask the rest of the WebKit contributors.
 
 -- Darin
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev
 
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Adam Barth
The HTMLTokenizer still works in UChars.  There's likely some
performance to be gained by moving it to an 8-bit character type.
There's some trickiness involved because HTML entities can expand to
characters outside of Latin-1. Also, it's unclear if we want two
tokenizers (one that's 8 bits wide and another that's 16 bits wide) or
if we should find a way for the 8-bit tokenizer to handle, for
example, UTF-16 encoded network responses.

Adam


On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler da...@apple.com wrote:
 No. I retract my question. Sounds like we already have it right! thanks for 
 setting me straight.

 Maybe some day we could make a non copying code path that points directly at 
 the data in the SharedBuffer, but I have no idea if that'd be beneficial.

 -- Darin

 Sent from my iPhone

 On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote:

 There is an all-ASCII case in TextCodecUTF8::decode().  It should be keeping 
 all ASCII data as 8 bit.  TextCodecWindowsLatin1::decode() has not only an 
 all-ASCII case, but it only up converts to 16 bit in a couple of rare cases. 
  Is there some other case you don't think we are handling?

 - Michael

 On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote:

 Hi folks.

 Today, bytes that come in from the network get turned into UTF-16 by the 
 decoding process. We then turn some of them back into Latin-1 during the 
 parsing process. Should we make changes so there’s an 8-bit path? It might 
 be as simple as writing code that has more of an all-ASCII special case in 
 TextCodecUTF8 and something similar in TextCodecWindowsLatin1.

 Is there something significant to be gained here? I’ve been wondering this 
 for a while, so I thought I’d ask the rest of the WebKit contributors.

 -- Darin
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Michael Saboff
The various tokenizers / lexers work various ways to handle LChar versus UChar 
input streams.  Most of the other tokenizers are templatized on input character 
type. In the case of HTML, the tokenizer handles a UChar character at a time.  
For 8 bit input streams, the zero extension of a LChar to a UChar is zero cost. 
 There may be additional performance to be gained by doing all other possible 
handling in 8 bits, but an 8 bit stream can still contain escapes that need a 
UChar representation as you point out.  Using a character type template 
approach was deemed to be too unwieldy for the HTML tokenizer.  The HTML 
tokenizer uses SegmentedString's that can consist of sub strings with either 
LChar and UChar.  That is where the LChar to UChar zero extension happens for 
an 8 bit sub string.

My research showed that at the time showed that there were very few UTF-16 only 
resources (5% IIRC), although I expect the number to grow.

- Michael


On Mar 7, 2013, at 11:11 AM, Adam Barth aba...@webkit.org wrote:

 The HTMLTokenizer still works in UChars.  There's likely some
 performance to be gained by moving it to an 8-bit character type.
 There's some trickiness involved because HTML entities can expand to
 characters outside of Latin-1. Also, it's unclear if we want two
 tokenizers (one that's 8 bits wide and another that's 16 bits wide) or
 if we should find a way for the 8-bit tokenizer to handle, for
 example, UTF-16 encoded network responses.
 
 Adam
 
 
 On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler da...@apple.com wrote:
 No. I retract my question. Sounds like we already have it right! thanks for 
 setting me straight.
 
 Maybe some day we could make a non copying code path that points directly at 
 the data in the SharedBuffer, but I have no idea if that'd be beneficial.
 
 -- Darin
 
 Sent from my iPhone
 
 On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote:
 
 There is an all-ASCII case in TextCodecUTF8::decode().  It should be 
 keeping all ASCII data as 8 bit.  TextCodecWindowsLatin1::decode() has not 
 only an all-ASCII case, but it only up converts to 16 bit in a couple of 
 rare cases.  Is there some other case you don't think we are handling?
 
 - Michael
 
 On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote:
 
 Hi folks.
 
 Today, bytes that come in from the network get turned into UTF-16 by the 
 decoding process. We then turn some of them back into Latin-1 during the 
 parsing process. Should we make changes so there’s an 8-bit path? It might 
 be as simple as writing code that has more of an all-ASCII special case in 
 TextCodecUTF8 and something similar in TextCodecWindowsLatin1.
 
 Is there something significant to be gained here? I’ve been wondering this 
 for a while, so I thought I’d ask the rest of the WebKit contributors.
 
 -- Darin
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev
 
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Adam Barth
Yes, I understand how the HTML tokenizer works.  :)

Adam


On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote:
 The various tokenizers / lexers work various ways to handle LChar versus 
 UChar input streams.  Most of the other tokenizers are templatized on input 
 character type. In the case of HTML, the tokenizer handles a UChar character 
 at a time.  For 8 bit input streams, the zero extension of a LChar to a UChar 
 is zero cost.  There may be additional performance to be gained by doing all 
 other possible handling in 8 bits, but an 8 bit stream can still contain 
 escapes that need a UChar representation as you point out.  Using a character 
 type template approach was deemed to be too unwieldy for the HTML tokenizer.  
 The HTML tokenizer uses SegmentedString's that can consist of sub strings 
 with either LChar and UChar.  That is where the LChar to UChar zero extension 
 happens for an 8 bit sub string.

 My research showed that at the time showed that there were very few UTF-16 
 only resources (5% IIRC), although I expect the number to grow.

 - Michael


 On Mar 7, 2013, at 11:11 AM, Adam Barth aba...@webkit.org wrote:

 The HTMLTokenizer still works in UChars.  There's likely some
 performance to be gained by moving it to an 8-bit character type.
 There's some trickiness involved because HTML entities can expand to
 characters outside of Latin-1. Also, it's unclear if we want two
 tokenizers (one that's 8 bits wide and another that's 16 bits wide) or
 if we should find a way for the 8-bit tokenizer to handle, for
 example, UTF-16 encoded network responses.

 Adam


 On Thu, Mar 7, 2013 at 10:11 AM, Darin Adler da...@apple.com wrote:
 No. I retract my question. Sounds like we already have it right! thanks for 
 setting me straight.

 Maybe some day we could make a non copying code path that points directly 
 at the data in the SharedBuffer, but I have no idea if that'd be beneficial.

 -- Darin

 Sent from my iPhone

 On Mar 7, 2013, at 10:01 AM, Michael Saboff msab...@apple.com wrote:

 There is an all-ASCII case in TextCodecUTF8::decode().  It should be 
 keeping all ASCII data as 8 bit.  TextCodecWindowsLatin1::decode() has not 
 only an all-ASCII case, but it only up converts to 16 bit in a couple of 
 rare cases.  Is there some other case you don't think we are handling?

 - Michael

 On Mar 7, 2013, at 9:29 AM, Darin Adler da...@apple.com wrote:

 Hi folks.

 Today, bytes that come in from the network get turned into UTF-16 by the 
 decoding process. We then turn some of them back into Latin-1 during the 
 parsing process. Should we make changes so there’s an 8-bit path? It 
 might be as simple as writing code that has more of an all-ASCII special 
 case in TextCodecUTF8 and something similar in TextCodecWindowsLatin1.

 Is there something significant to be gained here? I’ve been wondering 
 this for a while, so I thought I’d ask the rest of the WebKit 
 contributors.

 -- Darin
 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 https://lists.webkit.org/mailman/listinfo/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev


Re: [webkit-dev] Should we create an 8-bit path from the network stack to the parser?

2013-03-07 Thread Brady Eidson
 On Thu, Mar 7, 2013 at 2:14 PM, Michael Saboff msab...@apple.com wrote:
 The various tokenizers / lexers work various ways to handle LChar versus 
 UChar input streams.  Most of the other tokenizers are templatized on input 
 character type. In the case of HTML, the tokenizer handles a UChar character 
 at a time.  For 8 bit input streams, the zero extension of a LChar to a 
 UChar is zero cost.  There may be additional performance to be gained by 
 doing all other possible handling in 8 bits, but an 8 bit stream can still 
 contain escapes that need a UChar representation as you point out.  Using a 
 character type template approach was deemed to be too unwieldy for the HTML 
 tokenizer.  The HTML tokenizer uses SegmentedString's that can consist of 
 sub strings with either LChar and UChar.  That is where the LChar to UChar 
 zero extension happens for an 8 bit sub string.
 
 My research showed that at the time showed that there were very few UTF-16 
 only resources (5% IIRC), although I expect the number to grow.

On Mar 7, 2013, at 2:16 PM, Adam Barth aba...@webkit.org wrote:
 Yes, I understand how the HTML tokenizer works.  :)

I didn't understand these details, and I really appreciate Michael describing 
them.  I'm also glad others on the mailing list had an opportunity to get 
something out of this.

~Brady

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev