Branch: refs/heads/main
Home: https://github.com/WebKit/WebKit
Commit: e6ee2eb472c1b527eef8a439a7c6d227955271ac
https://github.com/WebKit/WebKit/commit/e6ee2eb472c1b527eef8a439a7c6d227955271ac
Author: Michael Saboff <[email protected]>
Date: 2024-11-12 (Tue, 12 Nov 2024)
Changed paths:
A JSTests/microbenchmarks/regexp-anychar-character-classes.js
A JSTests/stress/regexp-character-class-coalescing.js
M Source/JavaScriptCore/yarr/YarrJIT.cpp
M Source/JavaScriptCore/yarr/YarrPattern.cpp
M Source/JavaScriptCore/yarr/YarrPattern.h
M Source/JavaScriptCore/yarr/create_regex_tables
Log Message:
-----------
[Yarr] Improve processing of [\s\S] character classes
https://bugs.webkit.org/show_bug.cgi?id=283003
rdar://135409524
Reviewed by Yusuke Suzuki.
The character class [\s\S], which is all white space and non-white space
characters is used in lieu of
‘.’ (any character). Many developers use [\s\S] instead of ’.’ because the ‘.’
any character does not
include line terminators unless the dot all ’s’ flag is added to the regex.
The character class [\s\S]
matches any character regardless of flags added to the expression.
The processing of [\s\S] was sub optimal due to several issues.
1. The code was not coalescing the combination of \s and \S into a single
combined range of 0...max
character.
2. The JIT generation of regex’s for 8-bit strings that contained character
classes with code points
greater than 255 still included comparisons to 16 bit code points.
3. The code point U+180E, the Mongolian Vowel Separator, was changed from
being white space to non-white
space with ECMAScript 2016. Our code generator for \S was not changed to
include that code point.
When \S was used by itself, matching was done with the _spacesData table
instead of using individual
code points and ranges. When combined with \s, or any other character
class, we weren’t matching U+180E.
4. When using the ‘v’ flag, the character class processing for an “any
character” character classes that
also contained strings did not process those strings.
Fixed these issues by fixing the coalescing code to reduce produced character
classes to the minimum set
of individual code points and code point ranges to check. Added code to
extract the 8 bit only part of
a character class before emitting the JIT code. Added U+180E to the non-spaces
built-in character class.
Changed Yarr::matchCharacterClassTermInner() to handle strings with with all
character class sets including
those that match all character.
After these changes, the ARM v8 code size for the test regexp,
/([\s\S]+?)Abc123([\s\S]+)EOL/ went from
1460 bytes down to 356 bytes.
Added tests for these changes. Also added a new micro benchmark to test
performance improvements of
[\s\S] in four similar regular expressions. These regular expressions differ
in greediness and the
minimum match size. On an M3 equipped MacBook Pro, that benchmark shows a 2+
times improvement.
Baseline FixCharacterClasses
regexp-anychar-character-classes: 112.3561+-2.4038 52.8811+-3.0929 ^
definitely 2.1247x faster
* JSTests/microbenchmarks/regexp-anychar-character-classes.js: Added.
* JSTests/stress/regexp-character-class-coalescing.js: Added.
(arrayToString):
(objectToString):
(dumpValue):
(compareArray):
(compareGroups):
(testRegExp):
(testRegExpSyntaxError):
* Source/JavaScriptCore/yarr/YarrJIT.cpp:
* Source/JavaScriptCore/yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::unicodeOpSorted):
(JSC::Yarr::CharacterClassConstructor::coalesceTables):
(JSC::Yarr::YarrPatternConstructor::atomCharacterClassEnd):
(JSC::Yarr::CharacterClass::copyOnly8BitCharacterData):
* Source/JavaScriptCore/yarr/YarrPattern.h:
* Source/JavaScriptCore/yarr/create_regex_tables:
Canonical link: https://commits.webkit.org/286509@main
To unsubscribe from these emails, change your notification settings at
https://github.com/WebKit/WebKit/settings/notifications
_______________________________________________
webkit-changes mailing list
[email protected]
https://lists.webkit.org/mailman/listinfo/webkit-changes