Branch: refs/heads/main
  Home:   https://github.com/WebKit/WebKit
  Commit: a551f89f60b6372833ff3c139223bb6341b6f256
      
https://github.com/WebKit/WebKit/commit/a551f89f60b6372833ff3c139223bb6341b6f256
  Author: Wenson Hsieh <[email protected]>
  Date:   2025-12-24 (Wed, 24 Dec 2025)

  Changed paths:
    A 
LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls-expected.txt
    A LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls.html
    M Source/WebCore/Headers.cmake
    M Source/WebCore/Sources.txt
    M Source/WebCore/WebCore.xcodeproj/project.pbxproj
    M Source/WebCore/page/text-extraction/TextExtraction.cpp
    M Source/WebCore/page/text-extraction/TextExtractionTypes.h
    A Source/WebCore/platform/StringEntropyHelpers.cpp
    A Source/WebCore/platform/StringEntropyHelpers.h
    M Source/WebKit/Shared/TextExtractionToStringConversion.cpp
    M Source/WebKit/Shared/TextExtractionToStringConversion.h
    M Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in
    M Source/WebKit/UIProcess/API/Cocoa/WKWebView.mm
    M Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.h
    M Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.mm
    M Tools/TestRunnerShared/UIScriptContext/Bindings/UIScriptController.idl
    M Tools/TestRunnerShared/UIScriptContext/UIScriptController.h
    M Tools/TestRunnerShared/UIScriptContext/UIScriptControllerShared.cpp
    M Tools/WebKitTestRunner/cocoa/UIScriptControllerCocoa.mm

  Log Message:
  -----------
  [AutoFill Debugging] Part 1/2: Add an option to heuristically shorten/redact 
high-entropy URLs in extracted text
https://bugs.webkit.org/show_bug.cgi?id=304653
rdar://165847831

Reviewed by Richard Robinson.

Add support for a flag, `-shortenURLs`, that clients can use to opt into 
aggressive policy around
shortening link `href` and image `src` when performing text extraction. For 
links, we discard all
query parameters and fragments, and any path components that are not 
"low-entropy" (based on the
results of a fast, very lightweight binary classifier — see below). For images, 
we use the last path
component only if it's "low-entropy", and otherwise fall back to "image" 
(preserving any existing
file extension).

Test: fast/text-extraction/debug-text-extraction-shorten-urls.html

* 
LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls-expected.txt:
 Added.
* LayoutTests/fast/text-extraction/debug-text-extraction-shorten-urls.html: 
Added.

Add a layout test to exercise this new option.

* Source/WebCore/Headers.cmake:
* Source/WebCore/Sources.txt:
* Source/WebCore/WebCore.xcodeproj/project.pbxproj:
* Source/WebCore/page/text-extraction/TextExtraction.cpp:
(WebCore::TextExtraction::extractItemData):
* Source/WebCore/page/text-extraction/TextExtractionTypes.h:

Use the helpers below to strip out high-entropy path components from extracted 
URLs, along with any
query parameters and fragment.

* Source/WebCore/platform/StringEntropyHelpers.cpp: Added.
(WebCore::StringEntropyHelpers::symbol):
(WebCore::StringEntropyHelpers::dequantize):
(WebCore::StringEntropyHelpers::bigramWeight):
(WebCore::StringEntropyHelpers::entropyScore):
(WebCore::StringEntropyHelpers::isProbablyHumanReadable):
(WebCore::StringEntropyHelpers::lowEntropyLastPathComponent):
(WebCore::StringEntropyHelpers::removeHighEntropyComponents):

Add the fast path component classifier; see above for more details. Each 
character is mapped to one
of 10 character symbol types (e.g. uppercase hex, lowercase hex, uppercase 
non-hex, lowercase non-
hex, digits, etc.); the classifier is a very simple single-layer perceptron 
that takes (as inputs)
bigrams where each bigram consists of two adjacent symbol types. The 100 
weights corresponding to
each bigram are encoded in a tiny lookup table, where each weight is quantized 
to a single byte
(`uint8_t`).

* Source/WebCore/platform/StringEntropyHelpers.h: Added.
* Source/WebKit/Shared/TextExtractionToStringConversion.cpp:
(WebKit::centerEllipsize):
(WebKit::TextExtractionAggregator::shortenURLs const):
(WebKit::addPartsForItem):
(WebKit::addTextRepresentationRecursive):
(WebKit::normalizedURLString): Deleted.

Honor the `shortenURLs` flag by using the shortened versions of link hrefs and 
image sources.

* Source/WebKit/Shared/TextExtractionToStringConversion.h:
* Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in:
* Source/WebKit/UIProcess/API/Cocoa/WKWebView.mm:
(-[WKWebView 
_extractDebugTextWithConfigurationWithoutUpdatingFilterRules:completionHandler:]):
* Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.h:
* Source/WebKit/UIProcess/API/Cocoa/_WKTextExtraction.mm:
(-[_WKTextExtractionConfiguration setShortenURLs:]):
* Tools/TestRunnerShared/UIScriptContext/Bindings/UIScriptController.idl:
* Tools/TestRunnerShared/UIScriptContext/UIScriptController.h:
* Tools/TestRunnerShared/UIScriptContext/UIScriptControllerShared.cpp:
(WTR::toTextExtractionTestOptions):

Add plumbing from `UIHelper` -> `WebKitTestRunner`, for the new `shortenURLs` 
flag.

* Tools/WebKitTestRunner/cocoa/UIScriptControllerCocoa.mm:
(WTR::createTextExtractionConfiguration):

Canonical link: https://commits.webkit.org/304927@main



To unsubscribe from these emails, change your notification settings at 
https://github.com/WebKit/WebKit/settings/notifications

Reply via email to