Skip to site navigation (Press enter)

[webkit-changes] [280772] trunk/Source/WebCore

heycam Mon, 09 Aug 2021 04:05:40 -0700

Title: [280772] trunk/Source/WebCore

Revision: 280772
Author: [email protected]
Date: 2021-08-09 04:05:04 -0700 (Mon, 09 Aug 2021)

Log Message

Cache recently atomized all-whitespace strings for use by the HTML parser
https://bugs.webkit.org/show_bug.cgi?id=228858


Reviewed by Yusuke Suzuki.

We have a memory optimization where the HTML parser will atomize any
text node string that is all whitespace.  This can be expensive,
as we need to loop over the string's characters three times: to detect
that it is indeed all whitespace, to hash the string, and to compare
to any existing entry in the atom hash table.

Many whitespace strings encountered during parsing have a limited
form -- they have a few runs of consecutive equal whitespace
characters, e.g. it's common to see a newline followed by a number
of space characters.  We can take advantage of this by compressing
the whitespace string into a simple run-length encoded form while
we loop over the characters to check that the string is all
whitespace.  Unlike a hash, this encoded form perfectly identifies the
string content.

We add a WhitespaceCache that is owned by the Document, which stores
the most recently atomized all-whitespace string encountered by the
parser for a given length, and keys it with the encoded form.  The
parser can then look up the WhitespaceCache and find an AtomString
without the need to perform a lookup in the atom hash table.

The WhitespaceCache continues to hold on to the cached whitespace
atoms for the life of the document.  The WhitespaceCache itself
takes a bit over 1 KiB when empty, and if completely populated would
take a bit over 2 KiB plus 8 KiB of atom string data (a 1 byte string,
plus a 2 byte string, plus a 3 byte string, etc.).  This doesn't seem
high enough to be worth clearing this out just to avoid memory usage
from text nodes that have been removed from the document.

We have the Document own the WhitespaceCache so that it's not just the
initial document load, but any uses of innerHTML or other fragment
parsing APIs, that can re-use previously atomized whitespace strings.

The choice of WhitespaceCache::maximumCachedStringLength = 128 is
somewhat arbitrary.  The value must be <= 255 to avoid overflowing
the specific 64 bit code format used to encode the whitespace string.
Testing on Speedometer shows whitespace strings only up to length 26,
but it may be worth testing other content to see if it's worth
reducing maximumCachedStringLength.

When running Speedometer 2, no entries in the cache ever get replaced
by a different whitespace string of the same length, and 16 unique
whitespace string lengths are encountered.

We don't compute codes for 16 bit strings, since 16 bit strings passed
in to WhitespaceCache::lookup() are almost always the entirety of an
HTMLToken's data, and HTMLToken tracks whether any 16 bit characters are
present.  If there are any, then we know the string cannot be all
whitespace, and we use a WhitespaceMode value of NotAllWhitespace to
skip the atomization altogether.

This patch is an almost 1% improvement on Speedometer 2.

* dom/Document.cpp:
(WebCore::m_whitespaceCache):
* dom/Document.h:
(WebCore::Document::whitespaceCache):
* html/parser/HTMLConstructionSite.cpp:
(WebCore::HTMLConstructionSite::HTMLConstructionSite):
(WebCore::HTMLConstructionSite::insertTextNode):
(WebCore::WhitespaceCache::codeForString):
(WebCore::WhitespaceCache::lookup):
* html/parser/HTMLConstructionSite.h:
* html/parser/HTMLTreeBuilder.cpp:
(WebCore::HTMLTreeBuilder::insertPhoneNumberLink):
(WebCore::HTMLTreeBuilder::linkifyPhoneNumbers):
(WebCore::HTMLTreeBuilder::processCharacterBufferForInBody):
* html/parser/HTMLTreeBuilder.h:

Modified Paths

trunk/Source/WebCore/ChangeLog
trunk/Source/WebCore/dom/Document.cpp
trunk/Source/WebCore/dom/Document.h
trunk/Source/WebCore/html/parser/HTMLConstructionSite.cpp
trunk/Source/WebCore/html/parser/HTMLConstructionSite.h
trunk/Source/WebCore/html/parser/HTMLTreeBuilder.cpp
trunk/Source/WebCore/html/parser/HTMLTreeBuilder.h

Diff

Modified: trunk/Source/WebCore/ChangeLog (280771 => 280772)


--- trunk/Source/WebCore/ChangeLog	2021-08-09 10:53:05 UTC (rev 280771)
+++ trunk/Source/WebCore/ChangeLog	2021-08-09 11:05:04 UTC (rev 280772)
@@ -1,5 +1,81 @@
 2021-08-09  Cameron McCormack  <[email protected]>
 
+        Cache recently atomized all-whitespace strings for use by the HTML parser
+        https://bugs.webkit.org/show_bug.cgi?id=228858
+
+        Reviewed by Yusuke Suzuki.
+
+        We have a memory optimization where the HTML parser will atomize any
+        text node string that is all whitespace.  This can be expensive,
+        as we need to loop over the string's characters three times: to detect
+        that it is indeed all whitespace, to hash the string, and to compare
+        to any existing entry in the atom hash table.
+
+        Many whitespace strings encountered during parsing have a limited
+        form -- they have a few runs of consecutive equal whitespace
+        characters, e.g. it's common to see a newline followed by a number
+        of space characters.  We can take advantage of this by compressing
+        the whitespace string into a simple run-length encoded form while
+        we loop over the characters to check that the string is all
+        whitespace.  Unlike a hash, this encoded form perfectly identifies the
+        string content.
+
+        We add a WhitespaceCache that is owned by the Document, which stores
+        the most recently atomized all-whitespace string encountered by the
+        parser for a given length, and keys it with the encoded form.  The
+        parser can then look up the WhitespaceCache and find an AtomString
+        without the need to perform a lookup in the atom hash table.
+
+        The WhitespaceCache continues to hold on to the cached whitespace
+        atoms for the life of the document.  The WhitespaceCache itself
+        takes a bit over 1 KiB when empty, and if completely populated would
+        take a bit over 2 KiB plus 8 KiB of atom string data (a 1 byte string,
+        plus a 2 byte string, plus a 3 byte string, etc.).  This doesn't seem
+        high enough to be worth clearing this out just to avoid memory usage
+        from text nodes that have been removed from the document.
+
+        We have the Document own the WhitespaceCache so that it's not just the
+        initial document load, but any uses of innerHTML or other fragment
+        parsing APIs, that can re-use previously atomized whitespace strings.
+
+        The choice of WhitespaceCache::maximumCachedStringLength = 128 is
+        somewhat arbitrary.  The value must be <= 255 to avoid overflowing
+        the specific 64 bit code format used to encode the whitespace string.
+        Testing on Speedometer shows whitespace strings only up to length 26,
+        but it may be worth testing other content to see if it's worth
+        reducing maximumCachedStringLength.
+
+        When running Speedometer 2, no entries in the cache ever get replaced
+        by a different whitespace string of the same length, and 16 unique
+        whitespace string lengths are encountered.
+
+        We don't compute codes for 16 bit strings, since 16 bit strings passed
+        in to WhitespaceCache::lookup() are almost always the entirety of an
+        HTMLToken's data, and HTMLToken tracks whether any 16 bit characters are
+        present.  If there are any, then we know the string cannot be all
+        whitespace, and we use a WhitespaceMode value of NotAllWhitespace to
+        skip the atomization altogether.
+
+        This patch is an almost 1% improvement on Speedometer 2.
+
+        * dom/Document.cpp:
+        (WebCore::m_whitespaceCache):
+        * dom/Document.h:
+        (WebCore::Document::whitespaceCache):
+        * html/parser/HTMLConstructionSite.cpp:
+        (WebCore::HTMLConstructionSite::HTMLConstructionSite):
+        (WebCore::HTMLConstructionSite::insertTextNode):
+        (WebCore::WhitespaceCache::codeForString):
+        (WebCore::WhitespaceCache::lookup):
+        * html/parser/HTMLConstructionSite.h:
+        * html/parser/HTMLTreeBuilder.cpp:
+        (WebCore::HTMLTreeBuilder::insertPhoneNumberLink):
+        (WebCore::HTMLTreeBuilder::linkifyPhoneNumbers):
+        (WebCore::HTMLTreeBuilder::processCharacterBufferForInBody):
+        * html/parser/HTMLTreeBuilder.h:
+
+2021-08-09  Cameron McCormack  <[email protected]>
+
         Increase inline size of HTMLToken::Attribute::value
         https://bugs.webkit.org/show_bug.cgi?id=228910
         <rdar://problem/81686150>

Modified: trunk/Source/WebCore/dom/Document.cpp (280771 => 280772)


--- trunk/Source/WebCore/dom/Document.cpp	2021-08-09 10:53:05 UTC (rev 280771)
+++ trunk/Source/WebCore/dom/Document.cpp	2021-08-09 11:05:04 UTC (rev 280772)
@@ -92,6 +92,7 @@
 #include "HTMLBaseElement.h"
 #include "HTMLBodyElement.h"
 #include "HTMLCanvasElement.h"
+#include "HTMLConstructionSite.h"
 #include "HTMLDialogElement.h"
 #include "HTMLDocument.h"
 #include "HTMLElementFactory.h"
@@ -651,6 +652,7 @@
     , m_undoManager(UndoManager::create(*this))
     , m_editor(makeUniqueRef<Editor>(*this))
     , m_selection(makeUniqueRef<FrameSelection>(this))
+    , m_whitespaceCache(makeUniqueRef<WhitespaceCache>())
 {
     addToDocumentsMap();

Modified: trunk/Source/WebCore/dom/Document.h (280771 => 280772)


--- trunk/Source/WebCore/dom/Document.h	2021-08-09 10:53:05 UTC (rev 280771)
+++ trunk/Source/WebCore/dom/Document.h	2021-08-09 11:05:04 UTC (rev 280772)
@@ -233,6 +233,7 @@
 class WebAnimation;
 class WebGL2RenderingContext;
 class WebGLRenderingContext;
+class WhitespaceCache;
 class WindowEventLoop;
 class WindowProxy;
 class XPathEvaluator;
@@ -1516,6 +1517,8 @@
 
     HTMLDialogElement* activeModalDialog() const;
 
+    WhitespaceCache& whitespaceCache() { return m_whitespaceCache; }
+
 #if ENABLE(ATTACHMENT_ELEMENT)
     void registerAttachmentIdentifier(const String&);
     void didInsertAttachmentElement(HTMLAttachmentElement&);
@@ -2207,6 +2210,7 @@
     UniqueRef<FrameSelection> m_selection;
 
     ListHashSet<Ref<Element>> m_topLayerElements;
+    UniqueRef<WhitespaceCache> m_whitespaceCache;
 };
 
 Element* eventTargetElementForDocument(Document*);

Modified: trunk/Source/WebCore/html/parser/HTMLConstructionSite.cpp (280771 => 280772)


--- trunk/Source/WebCore/html/parser/HTMLConstructionSite.cpp	2021-08-09 10:53:05 UTC (rev 280771)
+++ trunk/Source/WebCore/html/parser/HTMLConstructionSite.cpp	2021-08-09 11:05:04 UTC (rev 280772)
@@ -234,6 +234,7 @@
     , m_redirectAttachToFosterParent(false)
     , m_maximumDOMTreeDepth(maximumDOMTreeDepth)
     , m_inQuirksMode(document.inQuirksMode())
+    , m_whitespaceCache(document.whitespaceCache())
 {
     ASSERT(m_document.isHTMLDocument() || m_document.isXHTMLDocument());
 }
@@ -246,6 +247,7 @@
     , m_redirectAttachToFosterParent(false)
     , m_maximumDOMTreeDepth(maximumDOMTreeDepth)
     , m_inQuirksMode(fragment.document().inQuirksMode())
+    , m_whitespaceCache(fragment.document().whitespaceCache())
 {
     ASSERT(m_document.isHTMLDocument() || m_document.isXHTMLDocument());
 }
@@ -574,10 +576,6 @@
     if (shouldFosterParent())
         findFosterSite(task);
 
-    // Strings composed entirely of whitespace are likely to be repeated.
-    // Turn them into AtomString so we share a single string for each.
-    bool shouldUseAtomString = whitespaceMode == AllWhitespace || (whitespaceMode == WhitespaceUnknown && isAllWhitespace(characters));
-
     unsigned currentPosition = 0;
     unsigned lengthLimit = shouldUseLengthLimit(*task.parent) ? Text::defaultLengthLimit : std::numeric_limits<unsigned>::max();
 
@@ -592,11 +590,13 @@
     }
 
     while (currentPosition < characters.length()) {
-        auto textNode = Text::createWithLengthLimit(task.parent->document(), shouldUseAtomString ? AtomString(characters).string() : characters, currentPosition, lengthLimit);
+        AtomString charactersAtom = m_whitespaceCache.lookup(characters, whitespaceMode);
+        auto textNode = Text::createWithLengthLimit(task.parent->document(), charactersAtom.isNull() ? characters : charactersAtom.string(), currentPosition, lengthLimit);
         // If we have a whole string of unbreakable characters the above could lead to an infinite loop. Exceeding the length limit is the lesser evil.
         if (!textNode->length()) {
             String substring = characters.substring(currentPosition);
-            textNode = Text::create(task.parent->document(), shouldUseAtomString ? AtomString(substring).string() : substring);
+            AtomString substringAtom = m_whitespaceCache.lookup(substring, whitespaceMode);
+            textNode = Text::create(task.parent->document(), substringAtom.isNull() ? substring : substringAtom.string());
         }
 
         currentPosition += textNode->length();
@@ -811,4 +811,111 @@
     m_taskQueue.append(WTFMove(task));
 }
 
+// Compute a 64 bit code that represents a whitespace-only string's contents.
+//
+// The code format is a sequence of four pairs of an 8 bit whitespace character
+// and an 8 bit count of that character. For example, 0x0A_02_20_08 represents
+// two newlines followed by eight space characters.
+//
+// Returns 0 if any non-whitespace characters are found.
+//
+// Returns -1 if the code would overflow due to finding more than four
+// whitespace character runs.
+template<WhitespaceMode whitespaceMode>
+uint64_t WhitespaceCache::codeForString(const String& string)
+{
+    ASSERT(whitespaceMode != NotAllWhitespace);
+    ASSERT(string.is8Bit());
+    ASSERT(!string.isEmpty());
+    ASSERT(string.length() <= maximumCachedStringLength);
+    static_assert(maximumCachedStringLength <= 0xFF, "Code format requires whitespace run length fit in one byte");
+
+    auto startOfRun = string.characters8();
+
+    if constexpr (whitespaceMode == WhitespaceUnknown) {
+        if (!isHTMLSpace(*startOfRun))
+            return 0;
+    }
+
+    LChar currentWhitespaceCharacter = *startOfRun;
+    auto character = startOfRun + 1;
+    auto end = startOfRun + string.length();
+
+    uint64_t code = 0;
+    int runsRemaining = 4;
+
+    for (;;) {
+        while (character != end && *character == currentWhitespaceCharacter)
+            ++character;
+
+        if constexpr (whitespaceMode == WhitespaceUnknown) {
+            if (character != end && !isHTMLSpace(*character))
+                return 0;
+        }
+
+        code <<= 16;
+        code |= (currentWhitespaceCharacter << 8);
+        code |= (character - startOfRun);
+
+        if (character == end)
+            return code;
+
+        if (!--runsRemaining)
+            return overflowWhitespaceCode;
+
+        startOfRun = character;
+        currentWhitespaceCharacter = *character;
+        ++character;
+
+        ASSERT(isHTMLSpace(currentWhitespaceCharacter));
+    }
+
+    return code;
 }
+
+AtomString WhitespaceCache::lookup(const String& string, WhitespaceMode whitespaceMode)
+{
+    if (whitespaceMode == NotAllWhitespace || !string.is8Bit() || string.isEmpty())
+        return AtomString();
+
+    size_t length = string.length();
+    if (length > maximumCachedStringLength)
+        return whitespaceMode == AllWhitespace || isAllWhitespace(string) ? AtomString(string) : AtomString();
+
+    uint64_t code;
+    if (whitespaceMode == AllWhitespace)
+        code = codeForString<AllWhitespace>(string);
+    else
+        code = codeForString<WhitespaceUnknown>(string);
+
+    if (!code)
+        return AtomString();
+
+    if (m_codes[length] == code) {
+        ASSERT(m_atoms[m_indexes[length]] == string);
+        WTFLogAlways("reuse code %llx", code);
+        return m_atoms[m_indexes[length]];
+    }
+
+    if (code == overflowWhitespaceCode) {
+        WTFLogAlways("override");
+        return AtomString(string);
+    }
+
+    if (m_codes[length]) {
+        WTFLogAlways("replace code %llx", code);
+        AtomString whitespaceAtom(string);
+        m_codes[length] = code;
+        m_atoms[m_indexes[length]] = whitespaceAtom;
+        return whitespaceAtom;
+    }
+
+    WTFLogAlways("new code %llx", code);
+    AtomString whitespaceAtom(string);
+    m_codes[length] = code;
+    m_indexes[length] = m_atoms.size();
+    m_atoms.append(whitespaceAtom);
+    return whitespaceAtom;
+}
+
+}

Modified: trunk/Source/WebCore/html/parser/HTMLConstructionSite.h (280771 => 280772)


--- trunk/Source/WebCore/html/parser/HTMLConstructionSite.h	2021-08-09 10:53:05 UTC (rev 280771)
+++ trunk/Source/WebCore/html/parser/HTMLConstructionSite.h	2021-08-09 11:05:04 UTC (rev 280772)
@@ -85,6 +85,7 @@
 class Element;
 class HTMLFormElement;
 class JSCustomElementInterface;
+class WhitespaceCache;
 
 class HTMLConstructionSite {
     WTF_MAKE_NONCOPYABLE(HTMLConstructionSite);
@@ -219,6 +220,29 @@
     unsigned m_maximumDOMTreeDepth;
 
     bool m_inQuirksMode;
+
+    WhitespaceCache& m_whitespaceCache;
 };
 
+class WhitespaceCache {
+    WTF_MAKE_FAST_ALLOCATED;
+public:
+    WhitespaceCache() = default;
+
+    AtomString lookup(const String&, WhitespaceMode);
+
+private:
+    template<WhitespaceMode> uint64_t codeForString(const String&);
+
+    constexpr static uint64_t overflowWhitespaceCode = static_cast<uint64_t>(-1);
+    constexpr static size_t maximumCachedStringLength = 128;
+
+    // Parallel arrays storing a 64 bit code and an index into m_atoms for the
+    // most recently atomized whitespace-only string of a given length.
+    uint64_t m_codes[maximumCachedStringLength] { 0 };
+    uint8_t m_indexes[maximumCachedStringLength] { 0 };
+
+    Vector<AtomString, maximumCachedStringLength> m_atoms;
+};
+
 } // namespace WebCore

Modified: trunk/Source/WebCore/html/parser/HTMLTreeBuilder.cpp (280771 => 280772)


--- trunk/Source/WebCore/html/parser/HTMLTreeBuilder.cpp	2021-08-09 10:53:05 UTC (rev 280771)
+++ trunk/Source/WebCore/html/parser/HTMLTreeBuilder.cpp	2021-08-09 11:05:04 UTC (rev 280772)
@@ -2206,7 +2206,7 @@
 
     processStartTag(WTFMove(aStartToken));
     m_tree.executeQueuedTasks();
-    m_tree.insertTextNode(string);
+    m_tree.insertTextNode(string, NotAllWhitespace);
     processEndTag(WTFMove(aEndToken));
 }
 
@@ -2215,7 +2215,7 @@
 // 2. Wraps the phone number in a tel: link.
 // 3. Goes back to step 1 if a phone number is found in the rest of the string.
 // 4. Appends the rest of the string as a text node.
-void HTMLTreeBuilder::linkifyPhoneNumbers(const String& string)
+void HTMLTreeBuilder::linkifyPhoneNumbers(const String& string, WhitespaceMode whitespaceMode)
 {
     ASSERT(TelephoneNumberDetector::isSupported());
 
@@ -2238,7 +2238,7 @@
 
         ASSERT(scannerPosition + relativeEndPosition < length);
 
-        m_tree.insertTextNode(string.substring(scannerPosition, relativeStartPosition));
+        m_tree.insertTextNode(string.substring(scannerPosition, relativeStartPosition), whitespaceMode);
         insertPhoneNumberLink(string.substring(scannerPosition + relativeStartPosition, relativeEndPosition - relativeStartPosition + 1));
 
         scannerPosition += relativeEndPosition + 1;
@@ -2248,10 +2248,10 @@
     if (scannerPosition > 0) {
         if (scannerPosition < length) {
             String after = string.substring(scannerPosition, length - scannerPosition);
-            m_tree.insertTextNode(after);
+            m_tree.insertTextNode(after, whitespaceMode);
         }
     } else
-        m_tree.insertTextNode(string);
+        m_tree.insertTextNode(string, whitespaceMode);
 }
 
 // Looks at the ancestors of the element to determine whether we're inside an element which disallows parsing phone numbers.
@@ -2430,14 +2430,15 @@
 void HTMLTreeBuilder::processCharacterBufferForInBody(ExternalCharacterTokenBuffer& buffer)
 {
     m_tree.reconstructTheActiveFormattingElements();
+    auto whitespaceMode = buffer.isAll8BitData() ? WhitespaceUnknown : NotAllWhitespace;
     String characters = buffer.takeRemaining();
 #if ENABLE(TELEPHONE_NUMBER_DETECTION) && PLATFORM(IOS_FAMILY)
     if (!isParsingFragment() && m_tree.isTelephoneNumberParsingEnabled() && shouldParseTelephoneNumbersInNode(m_tree.currentNode()) && TelephoneNumberDetector::isSupported())
-        linkifyPhoneNumbers(characters);
+        linkifyPhoneNumbers(characters, whitespaceMode);
     else
-        m_tree.insertTextNode(characters);
+        m_tree.insertTextNode(characters, whitespaceMode);
 #else
-    m_tree.insertTextNode(characters);
+    m_tree.insertTextNode(characters, whitespaceMode);
 #endif
     if (m_framesetOk && !isAllWhitespaceOrReplacementCharacters(characters))
         m_framesetOk = false;

Modified: trunk/Source/WebCore/html/parser/HTMLTreeBuilder.h (280771 => 280772)


--- trunk/Source/WebCore/html/parser/HTMLTreeBuilder.h	2021-08-09 10:53:05 UTC (rev 280771)
+++ trunk/Source/WebCore/html/parser/HTMLTreeBuilder.h	2021-08-09 11:05:04 UTC (rev 280772)
@@ -109,7 +109,7 @@
 
 #if ENABLE(TELEPHONE_NUMBER_DETECTION) && PLATFORM(IOS_FAMILY)
     void insertPhoneNumberLink(const String&);
-    void linkifyPhoneNumbers(const String&);
+    void linkifyPhoneNumbers(const String&, WhitespaceMode);
 #endif
 
     void processToken(AtomHTMLToken&&);

_______________________________________________
webkit-changes mailing list
[email protected]
https://lists.webkit.org/mailman/listinfo/webkit-changes