Branch: refs/heads/main
  Home:   https://github.com/WebKit/WebKit
  Commit: a6cf92c2ad27fd663e7de0ec58dfd2288aecc322
      
https://github.com/WebKit/WebKit/commit/a6cf92c2ad27fd663e7de0ec58dfd2288aecc322
  Author: Wenson Hsieh <[email protected]>
  Date:   2025-09-29 (Mon, 29 Sep 2025)

  Changed paths:
    M Source/WebCore/PAL/pal/cocoa/VisionSoftLink.mm
    M Source/WebCore/page/text-extraction/TextExtraction.cpp
    M Source/WebCore/page/text-extraction/TextExtraction.h
    M Source/WebCore/page/text-extraction/TextExtractionTypes.h
    M Source/WebKit/Platform/cocoa/ImageAnalysisUtilities.h
    M Source/WebKit/Platform/cocoa/ImageAnalysisUtilities.mm
    M Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in
    M Source/WebKit/UIProcess/API/Cocoa/WKWebView.mm
    M Source/WebKit/UIProcess/API/Cocoa/WKWebViewInternal.h
    M Source/WebKit/UIProcess/API/ios/WKWebViewIOS.mm
    M Source/WebKit/UIProcess/Cocoa/TextExtraction/WKTextExtractionUtilities.h
    M Source/WebKit/UIProcess/Cocoa/TextExtraction/WKTextExtractionUtilities.mm
    M Source/WebKit/UIProcess/WebPageProxy.cpp
    M Source/WebKit/UIProcess/WebPageProxy.h
    M Source/WebKit/UIProcess/ios/PageClientImplIOS.mm
    M Source/WebKit/UIProcess/mac/PageClientImplMac.mm
    M Source/WebKit/WebProcess/WebPage/WebPage.cpp
    M Source/WebKit/WebProcess/WebPage/WebPage.h
    M Source/WebKit/WebProcess/WebPage/WebPage.messages.in
    M Tools/TestWebKitAPI/Tests/WebKitCocoa/TextExtractionTests.mm
    M Tools/TestWebKitAPI/Tests/WebKitCocoa/debug-text-extraction.html

  Log Message:
  -----------
  [AutoFill Debugging] Filter out hidden text when sanitizing text extraction 
results
https://bugs.webkit.org/show_bug.cgi?id=299736
rdar://161567763

Reviewed by Aditya Keerthi and Abrar Rahman Protyasha.

Augment the text extraction filtering mechanism, such that it additionally 
snapshots long paragraphs
of text and verifies that the rendered text in the DOM is similar to what's 
visually legible.

Tests: TextExtractionTests.InteractionDebugDescription

* Source/WebCore/PAL/pal/cocoa/VisionSoftLink.mm:
* Source/WebCore/page/text-extraction/TextExtraction.cpp:
(WebCore::TextExtraction::rangeForExtractedText):

Add a helper method to map (visible text, optional node ID) to a `SimpleRange`. 
If unspecified, this
just searches the entire body for the text.

* Source/WebCore/page/text-extraction/TextExtraction.h:
* Source/WebCore/page/text-extraction/TextExtractionTypes.h:
* Source/WebKit/Platform/cocoa/ImageAnalysisUtilities.h:
* Source/WebKit/Platform/cocoa/ImageAnalysisUtilities.mm:
(WebKit::textRecognitionQueueSingleton):
(WebKit::recognizeText):

Add a helper method that takes a `CGImageRef`, and uses the Vision framework to 
scan the image for
all visible text. Most of the heavy lifting is done on a background queue, and 
the results are
dispatched back to the main runloop, where we invoke the completion handler.

* Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in:
* Source/WebKit/UIProcess/API/Cocoa/WKWebView.mm:
(-[WKWebView _requestTextExtraction:completionHandler:]):
(-[WKWebView _validateText:inNode:completionHandler:]):

Implement the main internal helper method here. The results are cached until 
main frame navigation,
such that repeated requests to validate the same strings don't incur any 
snapshotting or OCR cost.
The optional node identifier here is optional, and we fall back to searching 
the entire document if
it's `nil`. Otherwise, it can make searching for the visible text a bit more 
efficient, and
accurate.

(-[WKWebView _clearTextExtractionFilterCache]):

Pull out common logic to reset caches related to text extraction filtering 
after process swap or
mainframe navigation.

* Source/WebKit/UIProcess/API/Cocoa/WKWebViewInternal.h:
* Source/WebKit/UIProcess/API/ios/WKWebViewIOS.mm:
(-[WKWebView _processWillSwapOrDidExit]):

`-_clearTextExtractionFilterCache` when process swapping.

* Source/WebKit/UIProcess/Cocoa/TextExtraction/WKTextExtractionUtilities.h:
* Source/WebKit/UIProcess/Cocoa/TextExtraction/WKTextExtractionUtilities.mm:
(WebKit::filterTextRecursive):
(WebKit::filterText):

Extend `filterText` to additionally OCR any long piece of text that has passed 
first-level
validation in `TextExtractionFilter`; for any piece of text that's too 
dissimilar (with an arbitrary
threshold of 0.5 edit distance) from its OCR results, we replace that text with 
its OCR text. This
ensures that in cases where OCR is a tiny bit off (mistaking a `0` for an `O`, 
or `1` or an `l`) we
just allow the original string through, but in cases where most of the text is 
missing or different,
we instead prefer OCR results. This leads to no change in behavior for most 
non-hidden text.

(WebKit::computeSimilarity):

Add a helper method to compute edit distance similarity between the two given 
strings, or
`std::nullopt` if the strings are too short (below the `minimumLength`) for a 
sensible comparison.

* Source/WebKit/UIProcess/WebPageProxy.cpp:
(WebKit::WebPageProxy::takeSnapshotOfExtractedText):

Add an IPC message to snapshot an extracted text range, given text and node ID. 
This currently only
applies to main frame content, since text extraction only works with the main 
frame right now; when
we extend this extraction to subframes, we'll want to instead move this to 
`WebFrame(Proxy)` and
have `-_validateText:` above take a frame ID.

* Source/WebKit/UIProcess/WebPageProxy.h:
* Source/WebKit/UIProcess/ios/PageClientImplIOS.mm:
(WebKit::PageClientImpl::processDidExit):
(WebKit::PageClientImpl::processWillSwap):
(WebKit::PageClientImpl::didCommitLoadForMainFrame):
* Source/WebKit/UIProcess/mac/PageClientImplMac.mm:
(WebKit::PageClientImpl::processWillSwap):
(WebKit::PageClientImpl::didCommitLoadForMainFrame):

Make these all call into `-_clearTextExtractionFilterCache`.

* Source/WebKit/WebProcess/WebPage/WebPage.cpp:
(WebKit::WebPage::takeSnapshotOfExtractedText):

Use `rangeForExtractedText` above to map the text to a DOM range, and then use 
`TextIndicator` to
produce a snapshot of the range and send it back to the UI process.

* Source/WebKit/WebProcess/WebPage/WebPage.h:
* Source/WebKit/WebProcess/WebPage/WebPage.messages.in:
* Tools/TestWebKitAPI/Tests/WebKitCocoa/TextExtractionTests.mm:
(TestWebKitAPI::TEST(TextExtractionTests, InteractionDebugDescription)):
* Tools/TestWebKitAPI/Tests/WebKitCocoa/debug-text-extraction.html:

Augment an existing API test, to confirm that long runs of hidden text don't 
show up in the debug
text extraction output (in this case, white text on a white background).

Canonical link: https://commits.webkit.org/300716@main



To unsubscribe from these emails, change your notification settings at 
https://github.com/WebKit/WebKit/settings/notifications
_______________________________________________
webkit-changes mailing list
[email protected]
https://lists.webkit.org/mailman/listinfo/webkit-changes

Reply via email to