Branch: refs/heads/main
Home: https://github.com/WebKit/WebKit
Commit: 497d66f4b6748966084d33b847e333ff5b67c53f
https://github.com/WebKit/WebKit/commit/497d66f4b6748966084d33b847e333ff5b67c53f
Author: Chris Dumez <[email protected]>
Date: 2026-04-06 (Mon, 06 Apr 2026)
Changed paths:
M Source/WebCore/html/parser/HTMLTokenizer.cpp
M Source/WebCore/html/parser/InputStreamPreprocessor.h
M Source/WebCore/platform/text/SegmentedString.h
Log Message:
-----------
Batch plain-text characters in HTML tokenizer DataState
https://bugs.webkit.org/show_bug.cgi?id=311554
Reviewed by Darin Adler.
The HTML tokenizer's DataState processes characters one at a time, paying
per-iteration overhead for each: advance() → peek() → 3 comparisons →
bufferCharacter() → goto. For text-heavy pages, this is the hottest path.
After buffering a character, scan ahead in the current 8-bit substring for
runs of plain text (stopping at '<', '&', '\r', '\n', '\0'). Buffer the
entire run at once via bufferCharacters() and advance the source with a
new SegmentedString::advancePastMultiple8() that skips multiple characters
without per-character function pointer dispatch.
Performance results (Parser/html-parser-text-heavy, 5000 paragraphs):
Baseline: 54.0 ms median
Patched: 38.0 ms median → 30% faster
On the tag-heavy HTML spec (Parser/html-parser):
Baseline: 209.0 ms median
Patched: 207.0 ms median → ~1% faster
Real-world pages fall between these extremes depending on the ratio of
text to markup.
* Source/WebCore/html/parser/HTMLTokenizer.cpp:
(WebCore::HTMLTokenizer::processToken):
* Source/WebCore/html/parser/InputStreamPreprocessor.h:
(WebCore::InputStreamPreprocessor::skipNextNewLine const):
* Source/WebCore/platform/text/SegmentedString.h:
(WebCore::SegmentedString::currentSubstringSpan8 const):
(WebCore::SegmentedString::advancePastMultiple8):
Canonical link: https://commits.webkit.org/310687@main
To unsubscribe from these emails, change your notification settings at
https://github.com/WebKit/WebKit/settings/notifications