[webkit-changes] [WebKit/WebKit] 497d66: Batch plain-text characters in HTML tokenizer Data...

Chris Dumez Mon, 06 Apr 2026 22:42:14 -0700

  Branch: refs/heads/main
  Home:   https://github.com/WebKit/WebKit
  Commit: 497d66f4b6748966084d33b847e333ff5b67c53f
      
https://github.com/WebKit/WebKit/commit/497d66f4b6748966084d33b847e333ff5b67c53f
  Author: Chris Dumez <[email protected]>
  Date:   2026-04-06 (Mon, 06 Apr 2026)


  Changed paths:
    M Source/WebCore/html/parser/HTMLTokenizer.cpp
    M Source/WebCore/html/parser/InputStreamPreprocessor.h
    M Source/WebCore/platform/text/SegmentedString.h

  Log Message:
  -----------
  Batch plain-text characters in HTML tokenizer DataState
https://bugs.webkit.org/show_bug.cgi?id=311554

Reviewed by Darin Adler.

The HTML tokenizer's DataState processes characters one at a time, paying
per-iteration overhead for each: advance() → peek() → 3 comparisons →
bufferCharacter() → goto. For text-heavy pages, this is the hottest path.

After buffering a character, scan ahead in the current 8-bit substring for
runs of plain text (stopping at '<', '&', '\r', '\n', '\0'). Buffer the
entire run at once via bufferCharacters() and advance the source with a
new SegmentedString::advancePastMultiple8() that skips multiple characters
without per-character function pointer dispatch.

Performance results (Parser/html-parser-text-heavy, 5000 paragraphs):
  Baseline: 54.0 ms median
  Patched:  38.0 ms median → 30% faster

On the tag-heavy HTML spec (Parser/html-parser):
  Baseline: 209.0 ms median
  Patched:  207.0 ms median → ~1% faster

Real-world pages fall between these extremes depending on the ratio of
text to markup.

* Source/WebCore/html/parser/HTMLTokenizer.cpp:
(WebCore::HTMLTokenizer::processToken):
* Source/WebCore/html/parser/InputStreamPreprocessor.h:
(WebCore::InputStreamPreprocessor::skipNextNewLine const):
* Source/WebCore/platform/text/SegmentedString.h:
(WebCore::SegmentedString::currentSubstringSpan8 const):
(WebCore::SegmentedString::advancePastMultiple8):

Canonical link: https://commits.webkit.org/310687@main



To unsubscribe from these emails, change your notification settings at 
https://github.com/WebKit/WebKit/settings/notifications

[webkit-changes] [WebKit/WebKit] 497d66: Batch plain-text characters in HTML tokenizer Data...

Reply via email to