Branch: refs/heads/main
Home: https://github.com/WebKit/WebKit
Commit: a551f89f60b6372833ff3c139223bb6341b6f256
https://github.com/WebKit/WebKit/commit/a551f89f60b6372833ff3c139223bb6341b6f256
Author: Wenson Hsieh <[email protected]>
Date: 2025-12-24 (Wed, 24 Dec 2025)
Changed paths:
A
LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls-expected.txt
A LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls.html
M Source/WebCore/Headers.cmake
M Source/WebCore/Sources.txt
M Source/WebCore/WebCore.xcodeproj/project.pbxproj
M Source/WebCore/page/text-extraction/TextExtraction.cpp
M Source/WebCore/page/text-extraction/TextExtractionTypes.h
A Source/WebCore/platform/StringEntropyHelpers.cpp
A Source/WebCore/platform/StringEntropyHelpers.h
M Source/WebKit/Shared/TextExtractionToStringConversion.cpp
M Source/WebKit/Shared/TextExtractionToStringConversion.h
M Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in
M Source/WebKit/UIProcess/API/Cocoa/WKWebView.mm
M Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.h
M Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.mm
M Tools/TestRunnerShared/UIScriptContext/Bindings/UIScriptController.idl
M Tools/TestRunnerShared/UIScriptContext/UIScriptController.h
M Tools/TestRunnerShared/UIScriptContext/UIScriptControllerShared.cpp
M Tools/WebKitTestRunner/cocoa/UIScriptControllerCocoa.mm
Log Message:
-----------
[AutoFill Debugging] Part 1/2: Add an option to heuristically shorten/redact
high-entropy URLs in extracted text
https://bugs.webkit.org/show_bug.cgi?id=304653
rdar://165847831
Reviewed by Richard Robinson.
Add support for a flag, `-shortenURLs`, that clients can use to opt into
aggressive policy around
shortening link `href` and image `src` when performing text extraction. For
links, we discard all
query parameters and fragments, and any path components that are not
"low-entropy" (based on the
results of a fast, very lightweight binary classifier — see below). For images,
we use the last path
component only if it's "low-entropy", and otherwise fall back to "image"
(preserving any existing
file extension).
Test: fast/text-extraction/debug-text-extraction-shorten-urls.html
*
LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls-expected.txt:
Added.
* LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls.html:
Added.
Add a layout test to exercise this new option.
* Source/WebCore/Headers.cmake:
* Source/WebCore/Sources.txt:
* Source/WebCore/WebCore.xcodeproj/project.pbxproj:
* Source/WebCore/page/text-extraction/TextExtraction.cpp:
(WebCore::TextExtraction::extractItemData):
* Source/WebCore/page/text-extraction/TextExtractionTypes.h:
Use the helpers below to strip out high-entropy path components from extracted
URLs, along with any
query parameters and fragment.
* Source/WebCore/platform/StringEntropyHelpers.cpp: Added.
(WebCore::StringEntropyHelpers::symbol):
(WebCore::StringEntropyHelpers::dequantize):
(WebCore::StringEntropyHelpers::bigramWeight):
(WebCore::StringEntropyHelpers::entropyScore):
(WebCore::StringEntropyHelpers::isProbablyHumanReadable):
(WebCore::StringEntropyHelpers::lowEntropyLastPathComponent):
(WebCore::StringEntropyHelpers::removeHighEntropyComponents):
Add the fast path component classifier; see above for more details. Each
character is mapped to one
of 10 character symbol types (e.g. uppercase hex, lowercase hex, uppercase
non-hex, lowercase non-
hex, digits, etc.); the classifier is a very simple single-layer perceptron
that takes (as inputs)
bigrams where each bigram consists of two adjacent symbol types. The 100
weights corresponding to
each bigram are encoded in a tiny lookup table, where each weight is quantized
to a single byte
(`uint8_t`).
* Source/WebCore/platform/StringEntropyHelpers.h: Added.
* Source/WebKit/Shared/TextExtractionToStringConversion.cpp:
(WebKit::centerEllipsize):
(WebKit::TextExtractionAggregator::shortenURLs const):
(WebKit::addPartsForItem):
(WebKit::addTextRepresentationRecursive):
(WebKit::normalizedURLString): Deleted.
Honor the `shortenURLs` flag by using the shortened versions of link hrefs and
image sources.
* Source/WebKit/Shared/TextExtractionToStringConversion.h:
* Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in:
* Source/WebKit/UIProcess/API/Cocoa/WKWebView.mm:
(-[WKWebView
_extractDebugTextWithConfigurationWithoutUpdatingFilterRules:completionHandler:]):
* Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.h:
* Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.mm:
(-[_WKTextExtractionConfiguration setShortenURLs:]):
* Tools/TestRunnerShared/UIScriptContext/Bindings/UIScriptController.idl:
* Tools/TestRunnerShared/UIScriptContext/UIScriptController.h:
* Tools/TestRunnerShared/UIScriptContext/UIScriptControllerShared.cpp:
(WTR::toTextExtractionTestOptions):
Add plumbing from `UIHelper` -> `WebKitTestRunner`, for the new `shortenURLs`
flag.
* Tools/WebKitTestRunner/cocoa/UIScriptControllerCocoa.mm:
(WTR::createTextExtractionConfiguration):
Canonical link: https://commits.webkit.org/304927@main
To unsubscribe from these emails, change your notification settings at
https://github.com/WebKit/WebKit/settings/notifications