Title: [221052] trunk
Revision
221052
Author
[email protected]
Date
2017-08-22 15:43:08 -0700 (Tue, 22 Aug 2017)

Log Message

Implement Unicode RegExp support in the YARR JIT
https://bugs.webkit.org/show_bug.cgi?id=174646

Reviewed by Filip Pizlo.

Source/_javascript_Core:

This support is only implemented for 64 bit platforms.  It wouldn't be too hard to add support
for 32 bit platforms with a reasonable number of spare registers.  This code slightly refactors
register usage to reduce the number of callee save registers used for non-Unicode expressions.
For Unicode expressions, there are several more registers used to store constants values for
processing surrogate pairs as well as discerning whether a character belongs to the Basic
Multilingual Plane (BMP) or one of the Supplemental Planes.

This implements JIT support for Unicode expressions very similar to how the interpreter works.
Just like in the interpreter, backtracking code uses more space on the stack to save positions.
Moved the BackTrackInfo* structs to YarrPattern as separate functions.  Added xxxIndex()
functions to each of these to simplify how the JIT code reads and writes the structure fields.

Given that reading surrogate pairs and transforming them into a single code point takes a
little processing, the code that implements reading a Unicode character is implemented as a
leaf function added to the end of the JIT'ed code.  The calling convention for
"tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
that argument values stay in argument registers for most of the generated code.
That helper takes the starting character address in one register, regUnicodeInputAndTrail,
and uses another dedicated temporary register, regUnicodeTemp.  The result is typically
returned in regT0.  If another return register is requested, we'll create an inline copy of
that function.

Added a new flag to CharacterClass to signify if a class has non-BMP characters.  This flag
is used in optimizeAlternative() where we swap the order of a fixed character class term with
a fixed character term that immediately follows it.  Since the non-BMP character class may
increment "index" when matching, that must be done first before trying to match a fixed
character term later in the string.

Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
function, getEffectiveAddress64(), with an ARM64 implementation.  It just calls x86Lea64()
on X86-64.  Also added an ImplicitAddress version of load16Unaligned().

(JSC::MacroAssemblerARM64::load16Unaligned):
(JSC::MacroAssemblerARM64::getEffectiveAddress64):
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::load16Unaligned):
(JSC::MacroAssemblerX86Common::load16):
* assembler/MacroAssemblerX86_64.h:
(JSC::MacroAssemblerX86_64::getEffectiveAddress64):
* create_regex_tables:
* runtime/RegExp.cpp:
(JSC::RegExp::compile):
* yarr/YarrInterpreter.cpp:
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
(JSC::Yarr::YarrGenerator::readCharacter):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::matchAssertionWordchar):
(JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generate):
(JSC::Yarr::YarrGenerator::backtrack):
(JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::generateReturn):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):
* yarr/YarrJIT.h:
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
(JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
(JSC::Yarr::BackTrackInfoBackReference::beginIndex):
(JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
(JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
(JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):

LayoutTests:

Updated tests.

* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:

Modified Paths

Diff

Modified: trunk/LayoutTests/ChangeLog (221051 => 221052)


--- trunk/LayoutTests/ChangeLog	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/LayoutTests/ChangeLog	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,3 +1,15 @@
+2017-08-22  Michael Saboff  <[email protected]>
+
+        Implement Unicode RegExp support in the YARR JIT
+        https://bugs.webkit.org/show_bug.cgi?id=174646
+
+        Reviewed by Filip Pizlo.
+
+        Updated tests.
+
+        * js/regexp-unicode-expected.txt:
+        * js/script-tests/regexp-unicode.js:
+
 2017-08-22  Brent Fulgham  <[email protected]>
 
         Unreviewed test fix after r221017.

Modified: trunk/LayoutTests/js/regexp-unicode-expected.txt (221051 => 221052)


--- trunk/LayoutTests/js/regexp-unicode-expected.txt	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/LayoutTests/js/regexp-unicode-expected.txt	2017-08-22 22:43:08 UTC (rev 221052)
@@ -88,6 +88,8 @@
 PASS re2.test("𒍅") is true
 PASS /πŒ†{2}/u.test("πŒ†πŒ†") is true
 PASS /πŒ†{2}/u.test("πŒ†πŒ†") is true
+PASS "𐐅𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
+PASS "𐐂𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
 PASS "𐐁𐐁𐐀".match(/𐐁{1,3}/u)[0] is "𐐁𐐁"
 PASS "𐐁𐐩".match(/𐐁{1,3}/iu)[0] is "𐐁𐐩"
 PASS "𐐁𐐩πͺ𐐩".match(/𐐁{1,}/iu)[0] is "𐐁𐐩"
@@ -116,6 +118,7 @@
 PASS "𐐀".match(/\d*/u)[0].length is 0
 PASS "123𐐀".match(/\d*/u)[0] is "123"
 PASS "12X3𐐀4".match(/\d{0,1}/ug) is ["1", "2", "", "3", "", "4", ""]
+PASS "𐐂𐐅𐐅𐐂𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
 PASS match3[0] is "a𐐐𐐐b"
 PASS match3[1] is undefined.
 PASS match3[2] is "a𐐐𐐐b"
@@ -136,9 +139,28 @@
 PASS /\u{4}/.test("u") is false
 PASS /\u{4}/.test("uuuu") is true
 PASS "800-555-1212".match(/[0-9\-]*/u)[0].length is 12
+PASS "πŸ‚‘πŸƒ‘πŸ‚ΈπŸƒ‰πŸƒš".match(re7)[0] is "πŸ‚‘πŸƒ‘"
+PASS "πŸ‚‘πŸƒ‘πŸ‚±πŸƒ‰πŸƒš".match(re7)[0] is "πŸ‚‘πŸƒ‘πŸ‚±"
+PASS "πŸ‚‘πŸƒ‘πŸ‚±πŸƒπŸƒš".match(re7)[0] is "πŸ‚‘πŸƒ‘πŸ‚±πŸƒ"
+PASS "πŸ‚£πŸƒ‘πŸ‚±πŸƒπŸƒš".match(re7)[0] is "πŸƒ‘πŸ‚±πŸƒ"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]*a|[πŒπŒ‘]*./iu)[0] is "πŒ‘πŒπŒ‘"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]*?a|[πŒπŒ‘]*?./iu)[0] is "πŒ‘"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]+a|[πŒπŒ‘]+./iu)[0] is "πŒ‘πŒπŒ‘"
+PASS "πŒ‘πŒπŒ‘".match(/[πŒπŒ‘]+?a|[πŒπŒ‘]+?./iu)[0] is "πŒ‘πŒ"
+PASS "C83|НАЧАВЬ".match(re8)[0] is "C83|НАЧАВЬ"
+PASS "This.Is.16.Chars|НАЧАВЬ".match(re8)[0] is "This.Is.16.Chars|НАЧАВЬ"
+PASS "Testing\nሴ 1 2 3".match(/^[α€€-𐃿] 1 2 3/um)[0] is "ሴ 1 2 3"
+PASS "Testing\n𐃰 1 2 3".match(/^[α€€-𐃿] 1 2 3/um)[0] is "𐃰 1 2 3"
+PASS "g\nሴ 1 2 3".match(/g\n^[α€€-𐃿] 1 2 3/um)[0] is "g\nሴ 1 2 3"
+PASS "g\n𐃰 1 2 3".match(/g\n^[α€€-𐃿] 1 2 3/um)[0] is "g\n𐃰 1 2 3"
+PASS "Testing ሴ\n1 2 3".match(/Testing [α€€-𐃿]$/um)[0] is "Testing ሴ"
+PASS "Testing 𐃰\n1 2 3".match(/Testing [α€€-𐃿]$/um)[0] is "Testing 𐃰"
+PASS "Testing ሴ\n1 2 3".match(/g [α€€-𐃿]$\n1/um)[0] is "g ሴ\n1"
+PASS "Testing 𐃰\n1 2 3".match(/g [α€€-𐃿]$\n1/um)[0] is "g 𐃰\n1"
 PASS "this is ba test".match(/is b\cha test/u)[0].length is 11
 PASS new RegExp("\\/", "u").source is "\\/"
 PASS r = new RegExp("\\u{110000}", "u") threw exception SyntaxError: Invalid regular _expression_: invalid unicode {} escape.
+PASS r = new RegExp("𐐅{2147483648}", "u") threw exception SyntaxError: Invalid regular _expression_: pattern exceeds string length limits.
 PASS r = new RegExp("\\-", "u") threw exception SyntaxError: Invalid regular _expression_: invalid escaped character for unicode pattern.
 PASS r = new RegExp("\\a", "u") threw exception SyntaxError: Invalid regular _expression_: invalid escaped character for unicode pattern.
 PASS r = new RegExp("[\\a]", "u") threw exception SyntaxError: Invalid regular _expression_: invalid escaped character for unicode pattern.

Modified: trunk/LayoutTests/js/script-tests/regexp-unicode.js (221051 => 221052)


--- trunk/LayoutTests/js/script-tests/regexp-unicode.js	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/LayoutTests/js/script-tests/regexp-unicode.js	2017-08-22 22:43:08 UTC (rev 221052)
@@ -124,6 +124,8 @@
 // Check quantified matches
 shouldBeTrue('/\u{1d306}{2}/u.test("\u{1d306}\u{1d306}")');
 shouldBeTrue('/\uD834\uDF06{2}/u.test("\uD834\uDF06\uD834\uDF06")');
+shouldBe('"\u{10405}\u{10405}\u{10405}\u{10405}".match(/\u{10405}{3}/u)[0]', '"\u{10405}\u{10405}\u{10405}"');
+shouldBe('"\u{10402}\u{10405}\u{10405}\u{10405}".match(/\u{10405}{3}/u)[0]', '"\u{10405}\u{10405}\u{10405}"');
 shouldBe('"\u{10401}\u{10401}\u{10400}".match(/\u{10401}{1,3}/u)[0]', '"\u{10401}\u{10401}"');
 shouldBe('"\u{10401}\u{10429}".match(/\u{10401}{1,3}/iu)[0]', '"\u{10401}\u{10429}"');
 shouldBe('"\u{10401}\u{10429}\u{1042a}\u{10429}".match(/\u{10401}{1,}/iu)[0]', '"\u{10401}\u{10429}"');
@@ -154,6 +156,7 @@
 shouldBe('"\u{10400}".match(/\\d*/u)[0].length', '0');
 shouldBe('"123\u{10400}".match(/\\d*/u)[0]', '"123"');
 shouldBe('"12X3\u{10400}4".match(/\\d{0,1}/ug)', '["1", "2", "", "3", "", "4", ""]');
+shouldBe('"\u{10402}\u{10405}\u{10405}\u{10402}\u{10405}\u{10405}\u{10405}".match(/\u{10405}{3}/u)[0]', '"\u{10405}\u{10405}\u{10405}"');
 
 var re3 = new RegExp("(a\u{10410}*bc)|(a\u{10410}*b)", "u");
 var match3 = "a\u{10410}\u{10410}b".match(re3);
@@ -191,6 +194,31 @@
 // Check that \- escape works in a character class for a unicode pattern
 shouldBe('"800-555-1212".match(/[0-9\\-]*/u)[0].length', '12');
 
+// Check that counted Unicode character classes work.
+var re7 = new RegExp("(?:[\u{1f0a1}\u{1f0b1}\u{1f0d1}\u{1f0c1}]{2,4})", "u");
+shouldBe('"\u{1f0a1}\u{1f0d1}\u{1f0b8}\u{1f0c9}\u{1f0da}".match(re7)[0]', '"\u{1f0a1}\u{1f0d1}"');
+shouldBe('"\u{1f0a1}\u{1f0d1}\u{1f0b1}\u{1f0c9}\u{1f0da}".match(re7)[0]', '"\u{1f0a1}\u{1f0d1}\u{1f0b1}"');
+shouldBe('"\u{1f0a1}\u{1f0d1}\u{1f0b1}\u{1f0c1}\u{1f0da}".match(re7)[0]', '"\u{1f0a1}\u{1f0d1}\u{1f0b1}\u{1f0c1}"');
+shouldBe('"\u{1f0a3}\u{1f0d1}\u{1f0b1}\u{1f0c1}\u{1f0da}".match(re7)[0]', '"\u{1f0d1}\u{1f0b1}\u{1f0c1}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]*a|[\u{10310}\u{10311}]*./iu)[0]', '"\u{10311}\u{10310}\u{10311}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]*?a|[\u{10310}\u{10311}]*?./iu)[0]', '"\u{10311}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]+a|[\u{10310}\u{10311}]+./iu)[0]', '"\u{10311}\u{10310}\u{10311}"');
+shouldBe('"\u{10311}\u{10310}\u{10311}".match(/[\u{10301}\u{10311}]+?a|[\u{10310}\u{10311}]+?./iu)[0]', '"\u{10311}\u{10310}"');
+
+var re8 = new  RegExp("^([0-9a-z\.]{3,16})\\|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}", "ui");
+shouldBe('"C83|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}".match(re8)[0]', '"C83|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}"');
+shouldBe('"This.Is.16.Chars|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}".match(re8)[0]', '"This.Is.16.Chars|\u{041d}\u{0410}\u{0427}\u{0410}\u{0422}\u{042c}"');
+
+// Check that unicode characters work with ^ and $ for multiline patterns
+shouldBe('"Testing\\n\u{1234} 1 2 3".match(/^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"\u{1234} 1 2 3"');
+shouldBe('"Testing\\n\u{100f0} 1 2 3".match(/^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"\u{100f0} 1 2 3"');
+shouldBe('"g\\n\u{1234} 1 2 3".match(/g\\n^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"g\\n\u{1234} 1 2 3"');
+shouldBe('"g\\n\u{100f0} 1 2 3".match(/g\\n^[\u{1000}-\u{100ff}] 1 2 3/um)[0]', '"g\\n\u{100f0} 1 2 3"');
+shouldBe('"Testing \u{1234}\\n1 2 3".match(/Testing [\u{1000}-\u{100ff}]$/um)[0]', '"Testing \u{1234}"');
+shouldBe('"Testing \u{100f0}\\n1 2 3".match(/Testing [\u{1000}-\u{100ff}]$/um)[0]', '"Testing \u{100f0}"');
+shouldBe('"Testing \u{1234}\\n1 2 3".match(/g [\u{1000}-\u{100ff}]$\\n1/um)[0]', '"g \u{1234}\\n1"');
+shouldBe('"Testing \u{100f0}\\n1 2 3".match(/g [\u{1000}-\u{100ff}]$\\n1/um)[0]', '"g \u{100f0}\\n1"');
+
 // Check that control letter escapes work with unicode flag
 shouldBe('"this is b\ba test".match(/is b\\cha test/u)[0].length', '11');
 
@@ -197,6 +225,7 @@
 // Check that invalid unicode patterns throw exceptions
 shouldBe('new RegExp("\\\\/", "u").source', '"\\\\/"');
 shouldThrow('r = new RegExp("\\\\u{110000}", "u")', '"SyntaxError: Invalid regular _expression_: invalid unicode {} escape"');
+shouldThrow('r = new RegExp("\u{10405}{2147483648}", "u")', '"SyntaxError: Invalid regular _expression_: pattern exceeds string length limits"');
 
 var invalidEscapeException = "SyntaxError: Invalid regular _expression_: invalid escaped character for unicode pattern";
 var newRegExp;

Modified: trunk/Source/_javascript_Core/ChangeLog (221051 => 221052)


--- trunk/Source/_javascript_Core/ChangeLog	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/ChangeLog	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,3 +1,105 @@
+2017-08-22  Michael Saboff  <[email protected]>
+
+        Implement Unicode RegExp support in the YARR JIT
+        https://bugs.webkit.org/show_bug.cgi?id=174646
+
+        Reviewed by Filip Pizlo.
+
+        This support is only implemented for 64 bit platforms.  It wouldn't be too hard to add support
+        for 32 bit platforms with a reasonable number of spare registers.  This code slightly refactors
+        register usage to reduce the number of callee save registers used for non-Unicode expressions.
+        For Unicode expressions, there are several more registers used to store constants values for
+        processing surrogate pairs as well as discerning whether a character belongs to the Basic
+        Multilingual Plane (BMP) or one of the Supplemental Planes.
+
+        This implements JIT support for Unicode expressions very similar to how the interpreter works.
+        Just like in the interpreter, backtracking code uses more space on the stack to save positions.
+        Moved the BackTrackInfo* structs to YarrPattern as separate functions.  Added xxxIndex()
+        functions to each of these to simplify how the JIT code reads and writes the structure fields.
+
+        Given that reading surrogate pairs and transforming them into a single code point takes a
+        little processing, the code that implements reading a Unicode character is implemented as a
+        leaf function added to the end of the JIT'ed code.  The calling convention for
+        "tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
+        that argument values stay in argument registers for most of the generated code.
+        That helper takes the starting character address in one register, regUnicodeInputAndTrail,
+        and uses another dedicated temporary register, regUnicodeTemp.  The result is typically
+        returned in regT0.  If another return register is requested, we'll create an inline copy of
+        that function.
+
+        Added a new flag to CharacterClass to signify if a class has non-BMP characters.  This flag
+        is used in optimizeAlternative() where we swap the order of a fixed character class term with
+        a fixed character term that immediately follows it.  Since the non-BMP character class may
+        increment "index" when matching, that must be done first before trying to match a fixed
+        character term later in the string.
+
+        Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
+        base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
+        function, getEffectiveAddress64(), with an ARM64 implementation.  It just calls x86Lea64()
+        on X86-64.  Also added an ImplicitAddress version of load16Unaligned().
+
+        (JSC::MacroAssemblerARM64::load16Unaligned):
+        (JSC::MacroAssemblerARM64::getEffectiveAddress64):
+        * assembler/MacroAssemblerX86Common.h:
+        (JSC::MacroAssemblerX86Common::load16Unaligned):
+        (JSC::MacroAssemblerX86Common::load16):
+        * assembler/MacroAssemblerX86_64.h:
+        (JSC::MacroAssemblerX86_64::getEffectiveAddress64):
+        * create_regex_tables:
+        * runtime/RegExp.cpp:
+        (JSC::RegExp::compile):
+        * yarr/YarrInterpreter.cpp:
+        * yarr/YarrJIT.cpp:
+        (JSC::Yarr::YarrGenerator::optimizeAlternative):
+        (JSC::Yarr::YarrGenerator::matchCharacterClass):
+        (JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
+        (JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
+        (JSC::Yarr::YarrGenerator::readCharacter):
+        (JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
+        (JSC::Yarr::YarrGenerator::matchAssertionWordchar):
+        (JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
+        (JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
+        (JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
+        (JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
+        (JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
+        (JSC::Yarr::YarrGenerator::generate):
+        (JSC::Yarr::YarrGenerator::backtrack):
+        (JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
+        (JSC::Yarr::YarrGenerator::generateEnter):
+        (JSC::Yarr::YarrGenerator::generateReturn):
+        (JSC::Yarr::YarrGenerator::YarrGenerator):
+        (JSC::Yarr::YarrGenerator::compile):
+        * yarr/YarrJIT.h:
+        * yarr/YarrPattern.cpp:
+        (JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
+        (JSC::Yarr::CharacterClassConstructor::reset):
+        (JSC::Yarr::CharacterClassConstructor::charClass):
+        (JSC::Yarr::CharacterClassConstructor::addSorted):
+        (JSC::Yarr::CharacterClassConstructor::addSortedRange):
+        (JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
+        (JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
+        * yarr/YarrPattern.h:
+        (JSC::Yarr::CharacterClass::CharacterClass):
+        (JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
+        (JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
+        (JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
+        (JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
+        (JSC::Yarr::BackTrackInfoBackReference::beginIndex):
+        (JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
+        (JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
+        (JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
+        (JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
+        (JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):
+
 2017-08-22  Per Arne Vollan  <[email protected]>
 
         Implement 64-bit MacroAssembler::probe support for Windows.

Modified: trunk/Source/_javascript_Core/assembler/MacroAssemblerARM64.h (221051 => 221052)


--- trunk/Source/_javascript_Core/assembler/MacroAssemblerARM64.h	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/assembler/MacroAssemblerARM64.h	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1176,6 +1176,11 @@
         m_assembler.ldrh(dest, address.base, memoryTempRegister);
     }
     
+    void load16Unaligned(ImplicitAddress address, RegisterID dest)
+    {
+        load16(address, dest);
+    }
+
     void load16Unaligned(BaseIndex address, RegisterID dest)
     {
         load16(address, dest);
@@ -1535,6 +1540,13 @@
         m_assembler.strb(src, dest, simm);
     }
 
+    void getEffectiveAddress64(BaseIndex address, RegisterID dest)
+    {
+        m_assembler.add<64>(dest, address.base, address.index, ARM64Assembler::LSL, address.scale);
+        if (address.offset)
+            add64(TrustedImm32(address.offset), dest);
+    }
+
     // Floating-point operations:
 
     static bool supportsFloatingPoint() { return true; }

Modified: trunk/Source/_javascript_Core/assembler/MacroAssemblerX86Common.h (221051 => 221052)


--- trunk/Source/_javascript_Core/assembler/MacroAssemblerX86Common.h	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/assembler/MacroAssemblerX86Common.h	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1163,6 +1163,11 @@
         load32(address, dest);
     }
 
+    void load16Unaligned(ImplicitAddress address, RegisterID dest)
+    {
+        load16(address, dest);
+    }
+
     void load16Unaligned(BaseIndex address, RegisterID dest)
     {
         load16(address, dest);
@@ -1225,6 +1230,11 @@
         m_assembler.movsbl_rr(src, dest);
     }
     
+    void load16(ImplicitAddress address, RegisterID dest)
+    {
+        m_assembler.movzwl_mr(address.offset, address.base, dest);
+    }
+
     void load16(BaseIndex address, RegisterID dest)
     {
         m_assembler.movzwl_mr(address.offset, address.base, address.index, address.scale, dest);

Modified: trunk/Source/_javascript_Core/assembler/MacroAssemblerX86_64.h (221051 => 221052)


--- trunk/Source/_javascript_Core/assembler/MacroAssemblerX86_64.h	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/assembler/MacroAssemblerX86_64.h	2017-08-22 22:43:08 UTC (rev 221052)
@@ -364,6 +364,11 @@
         m_assembler.leaq_mr(index.offset, index.base, index.index, index.scale, dest);
     }
 
+    void getEffectiveAddress64(BaseIndex address, RegisterID dest)
+    {
+        return x86Lea64(address, dest);
+    }
+
     void addPtrNoFlags(TrustedImm32 imm, RegisterID srcDest)
     {
         m_assembler.leaq_mr(imm.m_value, srcDest, srcDest);

Modified: trunk/Source/_javascript_Core/create_regex_tables (221051 => 221052)


--- trunk/Source/_javascript_Core/create_regex_tables	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/create_regex_tables	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,4 +1,4 @@
-# Copyright (C) 2010, 2013 Apple Inc. All rights reserved.
+# Copyright (C) 2010, 2013-2017 Apple Inc. All rights reserved.
 # 
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -97,6 +97,7 @@
             function += ("    auto characterClass = std::make_unique<CharacterClass>(_%sData, false);\n" % (name))
     else:
         function += ("    auto characterClass = std::make_unique<CharacterClass>();\n")
+    hasNonBMPCharacters = False
     for (min, max) in ranges:
         if (min == max):
             if (min > 127):
@@ -108,6 +109,9 @@
             function += ("    characterClass->m_rangesUnicode.append(CharacterRange(0x%04x, 0x%04x));\n" % (min, max))
         else:
             function += ("    characterClass->m_ranges.append(CharacterRange(0x%02x, 0x%02x));\n" % (min, max))
+        if max >= 0x10000:
+            hasNonBMPCharacters = True
+    function += ("    characterClass->m_hasNonBMPCharacters = %s;\n" % ("true" if hasNonBMPCharacters else "false"))
     function += ("    return characterClass;\n")
     function += ("}\n\n")
     functions += function

Modified: trunk/Source/_javascript_Core/runtime/RegExp.cpp (221051 => 221052)


--- trunk/Source/_javascript_Core/runtime/RegExp.cpp	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/runtime/RegExp.cpp	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,6 +1,6 @@
 /*
  *  Copyright (C) 1999-2001, 2004 Harri Porten ([email protected])
- *  Copyright (c) 2007, 2008, 2016 Apple Inc. All rights reserved.
+ *  Copyright (c) 2007, 2008, 2016-2017 Apple Inc. All rights reserved.
  *  Copyright (C) 2009 Torch Mobile, Inc.
  *  Copyright (C) 2010 Peter Varga ([email protected]), University of Szeged
  *
@@ -281,7 +281,7 @@
     }
 
 #if ENABLE(YARR_JIT)
-    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && !unicode() && vm->canUseRegExpJIT()) {
+    if (!pattern.m_containsBackreferences && !pattern.containsUnsignedLengthPattern() && vm->canUseRegExpJIT()) {
         Yarr::jitCompile(pattern, charSize, vm, m_regExpJITCode);
         if (!m_regExpJITCode.isFallBack()) {
             m_state = JITCode;

Modified: trunk/Source/_javascript_Core/yarr/YarrInterpreter.cpp (221051 => 221052)


--- trunk/Source/_javascript_Core/yarr/YarrInterpreter.cpp	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/yarr/YarrInterpreter.cpp	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013, 2016 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013-2017 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga ([email protected]), University of Szeged
  *
  * Redistribution and use in source and binary forms, with or without
@@ -44,30 +44,6 @@
 public:
     struct ParenthesesDisjunctionContext;
 
-    struct BackTrackInfoPatternCharacter {
-        uintptr_t begin; // Only needed for unicode patterns
-        uintptr_t matchAmount;
-    };
-    struct BackTrackInfoCharacterClass {
-        uintptr_t begin; // Only needed for unicode patterns
-        uintptr_t matchAmount;
-    };
-    struct BackTrackInfoBackReference {
-        uintptr_t begin; // Not really needed for greedy quantifiers.
-        uintptr_t matchAmount; // Not really needed for fixed quantifiers.
-    };
-    struct BackTrackInfoAlternative {
-        uintptr_t offset;
-    };
-    struct BackTrackInfoParentheticalAssertion {
-        uintptr_t begin;
-    };
-    struct BackTrackInfoParenthesesOnce {
-        uintptr_t begin;
-    };
-    struct BackTrackInfoParenthesesTerminal {
-        uintptr_t begin;
-    };
     struct BackTrackInfoParentheses {
         uintptr_t matchAmount;
         ParenthesesDisjunctionContext* lastContext;
@@ -2082,12 +2058,12 @@
 }
 
 // These should be the same for both UChar & LChar.
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoPatternCharacter) == (YarrStackSpaceForBackTrackInfoPatternCharacter * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoPatternCharacter);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoCharacterClass) == (YarrStackSpaceForBackTrackInfoCharacterClass * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoCharacterClass);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoBackReference) == (YarrStackSpaceForBackTrackInfoBackReference * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoBackReference);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoAlternative) == (YarrStackSpaceForBackTrackInfoAlternative * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoAlternative);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoParentheticalAssertion) == (YarrStackSpaceForBackTrackInfoParentheticalAssertion * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParentheticalAssertion);
-COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoParenthesesOnce) == (YarrStackSpaceForBackTrackInfoParenthesesOnce * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParenthesesOnce);
+COMPILE_ASSERT(sizeof(BackTrackInfoPatternCharacter) == (YarrStackSpaceForBackTrackInfoPatternCharacter * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoPatternCharacter);
+COMPILE_ASSERT(sizeof(BackTrackInfoCharacterClass) == (YarrStackSpaceForBackTrackInfoCharacterClass * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoCharacterClass);
+COMPILE_ASSERT(sizeof(BackTrackInfoBackReference) == (YarrStackSpaceForBackTrackInfoBackReference * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoBackReference);
+COMPILE_ASSERT(sizeof(BackTrackInfoAlternative) == (YarrStackSpaceForBackTrackInfoAlternative * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoAlternative);
+COMPILE_ASSERT(sizeof(BackTrackInfoParentheticalAssertion) == (YarrStackSpaceForBackTrackInfoParentheticalAssertion * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParentheticalAssertion);
+COMPILE_ASSERT(sizeof(BackTrackInfoParenthesesOnce) == (YarrStackSpaceForBackTrackInfoParenthesesOnce * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParenthesesOnce);
 COMPILE_ASSERT(sizeof(Interpreter<UChar>::BackTrackInfoParentheses) == (YarrStackSpaceForBackTrackInfoParentheses * sizeof(uintptr_t)), CheckYarrStackSpaceForBackTrackInfoParentheses);
 
 

Modified: trunk/Source/_javascript_Core/yarr/YarrJIT.cpp (221051 => 221052)


--- trunk/Source/_javascript_Core/yarr/YarrJIT.cpp	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/yarr/YarrJIT.cpp	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013, 2015-2016 Apple Inc. All rights reserved.
+ * Copyright (C) 2009-2017 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -51,12 +51,12 @@
 
     static const RegisterID regT0 = ARMRegisters::r4;
     static const RegisterID regT1 = ARMRegisters::r5;
-
     static const RegisterID initialStart = ARMRegisters::r6;
-#define HAVE_INITIAL_START_REG
 
     static const RegisterID returnRegister = ARMRegisters::r0;
     static const RegisterID returnRegister2 = ARMRegisters::r1;
+
+#define HAVE_INITIAL_START_REG
 #elif CPU(ARM64)
     static const RegisterID input = ARM64Registers::x0;
     static const RegisterID index = ARM64Registers::x1;
@@ -65,12 +65,19 @@
 
     static const RegisterID regT0 = ARM64Registers::x4;
     static const RegisterID regT1 = ARM64Registers::x5;
+    static const RegisterID regUnicodeInputAndTrail = ARM64Registers::x6;
+    static const RegisterID regUnicodeTemp = ARM64Registers::x7;
+    static const RegisterID initialStart = ARM64Registers::x8;
+    static const RegisterID supplementaryPlanesBase = ARM64Registers::x9;
+    static const RegisterID surrogateTagMask = ARM64Registers::x10;
+    static const RegisterID leadingSurrogateTag = ARM64Registers::x11;
+    static const RegisterID trailingSurrogateTag = ARM64Registers::x12;
 
-    static const RegisterID initialStart = ARM64Registers::x6;
-#define HAVE_INITIAL_START_REG
-
     static const RegisterID returnRegister = ARM64Registers::x0;
     static const RegisterID returnRegister2 = ARM64Registers::x1;
+
+#define HAVE_INITIAL_START_REG
+#define JIT_UNICODE_EXPRESSIONS
 #elif CPU(MIPS)
     static const RegisterID input = MIPSRegisters::a0;
     static const RegisterID index = MIPSRegisters::a1;
@@ -79,12 +86,12 @@
 
     static const RegisterID regT0 = MIPSRegisters::t4;
     static const RegisterID regT1 = MIPSRegisters::t5;
-
     static const RegisterID initialStart = MIPSRegisters::t6;
-#define HAVE_INITIAL_START_REG
 
     static const RegisterID returnRegister = MIPSRegisters::v0;
     static const RegisterID returnRegister2 = MIPSRegisters::v1;
+
+#define HAVE_INITIAL_START_REG
 #elif CPU(X86)
     static const RegisterID input = X86Registers::eax;
     static const RegisterID index = X86Registers::edx;
@@ -113,22 +120,35 @@
 #endif
 
     static const RegisterID regT0 = X86Registers::eax;
-    static const RegisterID regT1 = X86Registers::ebx;
+#if !OS(WINDOWS)
+    static const RegisterID regT1 = X86Registers::r8;
+#else
+    static const RegisterID regT1 = X86Registers::ecx;
+#endif
 
+    static const RegisterID initialStart = X86Registers::ebx;
 #if !OS(WINDOWS)
-    static const RegisterID initialStart = X86Registers::r8;
+    static const RegisterID regUnicodeInputAndTrail = X86Registers::r9;
+    static const RegisterID regUnicodeTemp = X86Registers::r10;
 #else
-    static const RegisterID initialStart = X86Registers::ecx;
+    static const RegisterID regUnicodeInputAndTrail = X86Registers::esi;
+    static const RegisterID regUnicodeTemp = X86Registers::edi;
 #endif
-#define HAVE_INITIAL_START_REG
+    static const RegisterID supplementaryPlanesBase = X86Registers::r12;
+    static const RegisterID surrogateTagMask = X86Registers::r13;
+    static const RegisterID leadingSurrogateTag = X86Registers::r14;
+    static const RegisterID trailingSurrogateTag = X86Registers::r15;
 
     static const RegisterID returnRegister = X86Registers::eax;
     static const RegisterID returnRegister2 = X86Registers::edx;
+
+#define HAVE_INITIAL_START_REG
+#define JIT_UNICODE_EXPRESSIONS
 #endif
 
     void optimizeAlternative(PatternAlternative* alternative)
     {
-        if (!alternative->m_terms.size())
+        if (!alternative->m_terms.size() || m_decodeSurrogatePairs)
             return;
 
         for (unsigned i = 0; i < alternative->m_terms.size() - 1; ++i) {
@@ -135,8 +155,10 @@
             PatternTerm& term = alternative->m_terms[i];
             PatternTerm& nextTerm = alternative->m_terms[i + 1];
 
+            // We can move BMP only character classes after fixed character terms.
             if ((term.type == PatternTerm::TypeCharacterClass)
                 && (term.quantityType == QuantifierFixedCount)
+                && (!term.characterClass->m_hasNonBMPCharacters)
                 && (nextTerm.type == PatternTerm::TypePatternCharacter)
                 && (nextTerm.quantityType == QuantifierFixedCount)) {
                 PatternTerm termCopy = term;
@@ -195,14 +217,16 @@
 
     void matchCharacterClass(RegisterID character, JumpList& matchDest, const CharacterClass* charClass)
     {
-        if (charClass->m_table) {
+        if (charClass->m_table && !m_decodeSurrogatePairs) {
             ExtendedAddress tableEntry(character, reinterpret_cast<intptr_t>(charClass->m_table));
             matchDest.append(branchTest8(charClass->m_tableInverted ? Zero : NonZero, tableEntry));
             return;
         }
-        Jump unicodeFail;
+        JumpList unicodeFail;
         if (charClass->m_matchesUnicode.size() || charClass->m_rangesUnicode.size()) {
-            Jump isAscii = branch32(LessThanOrEqual, character, TrustedImm32(0x7f));
+            JumpList isAscii;
+            if (charClass->m_matches.size() || charClass->m_ranges.size())
+                isAscii.append(branch32(LessThanOrEqual, character, TrustedImm32(0x7f)));
 
             if (charClass->m_matchesUnicode.size()) {
                 for (unsigned i = 0; i < charClass->m_matchesUnicode.size(); ++i) {
@@ -222,7 +246,8 @@
                 }
             }
 
-            unicodeFail = jump();
+            if (charClass->m_matches.size() || charClass->m_ranges.size())
+                unicodeFail = jump();
             isAscii.link(this);
         }
 
@@ -322,6 +347,42 @@
         return BaseIndex(input, indexReg, TimesTwo, (characterOffset * static_cast<int32_t>(sizeof(UChar))).unsafeGet());
     }
 
+#ifdef JIT_UNICODE_EXPRESSIONS
+    void tryReadUnicodeCharImpl(RegisterID resultReg)
+    {
+        ASSERT(m_charSize == Char16);
+
+        JumpList notUnicode;
+        load16Unaligned(regUnicodeInputAndTrail, resultReg);
+        and32(surrogateTagMask, resultReg, regUnicodeTemp);
+        notUnicode.append(branch32(NotEqual, regUnicodeTemp, leadingSurrogateTag));
+        addPtr(TrustedImm32(2), regUnicodeInputAndTrail);
+        getEffectiveAddress64(BaseIndex(input, length, TimesTwo), regUnicodeTemp);
+        notUnicode.append(branch32(AboveOrEqual, regUnicodeInputAndTrail, regUnicodeTemp));
+        load16Unaligned(Address(regUnicodeInputAndTrail), regUnicodeInputAndTrail);
+        and32(surrogateTagMask, regUnicodeInputAndTrail, regUnicodeTemp);
+        notUnicode.append(branch32(NotEqual, regUnicodeTemp, trailingSurrogateTag));
+        sub32(leadingSurrogateTag, resultReg);
+        sub32(trailingSurrogateTag, regUnicodeInputAndTrail);
+        lshift32(TrustedImm32(10), resultReg);
+        or32(regUnicodeInputAndTrail, resultReg);
+        add32(supplementaryPlanesBase, resultReg);
+        notUnicode.link(this);
+    }
+
+    void tryReadUnicodeChar(BaseIndex address, RegisterID resultReg)
+    {
+        ASSERT(m_charSize == Char16);
+
+        getEffectiveAddress64(address, regUnicodeInputAndTrail);
+
+        if (resultReg == regT0)
+            m_tryReadUnicodeCharacterCalls.append(nearCall());
+        else
+            tryReadUnicodeCharImpl(resultReg);
+    }
+#endif
+
     void readCharacter(Checked<unsigned> negativeCharacterOffset, RegisterID resultReg, RegisterID indexReg = index)
     {
         BaseIndex address = negativeOffsetIndexedAddress(negativeCharacterOffset, resultReg, indexReg);
@@ -328,6 +389,10 @@
 
         if (m_charSize == Char8)
             load8(address, resultReg);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        else if (m_decodeSurrogatePairs)
+            tryReadUnicodeChar(address, resultReg);
+#endif
         else
             load16Unaligned(address, resultReg);
     }
@@ -338,7 +403,7 @@
 
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
-        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
+        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch, m_canonicalMode));
         if (m_pattern.ignoreCase() && isASCIIAlpha(ch)) {
             or32(TrustedImm32(0x20), character);
             ch |= 0x20;
@@ -746,7 +811,15 @@
             nextIsNotWordChar.append(atEndOfInput());
 
         readCharacter(m_checkedOffset - term->inputPosition, character);
-        matchCharacterClass(character, nextIsWordChar, m_pattern.wordcharCharacterClass());
+
+        CharacterClass* wordcharCharacterClass;
+
+        if (m_unicodeIgnoreCase)
+            wordcharCharacterClass = m_pattern.wordUnicodeIgnoreCaseCharCharacterClass();
+        else
+            wordcharCharacterClass = m_pattern.wordcharCharacterClass();
+
+        matchCharacterClass(character, nextIsWordChar, wordcharCharacterClass);
     }
 
     void generateAssertionWordBoundary(size_t opIndex)
@@ -761,7 +834,15 @@
         if (!term->inputPosition)
             atBegin = branch32(Equal, index, Imm32(m_checkedOffset.unsafeGet()));
         readCharacter(m_checkedOffset - term->inputPosition + 1, character);
-        matchCharacterClass(character, matchDest, m_pattern.wordcharCharacterClass());
+
+        CharacterClass* wordcharCharacterClass;
+
+        if (m_unicodeIgnoreCase)
+            wordcharCharacterClass = m_pattern.wordUnicodeIgnoreCaseCharCharacterClass();
+        else
+            wordcharCharacterClass = m_pattern.wordcharCharacterClass();
+
+        matchCharacterClass(character, matchDest, wordcharCharacterClass);
         if (!term->inputPosition)
             atBegin.link(this);
 
@@ -833,7 +914,7 @@
 
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
-        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
+        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch, m_canonicalMode));
 
         if (m_pattern.ignoreCase() && isASCIIAlpha(ch))
 #if CPU(BIG_ENDIAN)
@@ -848,7 +929,8 @@
             if (nextTerm->type != PatternTerm::TypePatternCharacter
                 || nextTerm->quantityType != QuantifierFixedCount
                 || nextTerm->quantityMaxCount != 1
-                || nextTerm->inputPosition != (startTermPosition + numberCharacters))
+                || nextTerm->inputPosition != (startTermPosition + numberCharacters)
+                || (U16_LENGTH(nextTerm->patternCharacter) != 1 && m_decodeSurrogatePairs))
                 break;
 
             nextOp->m_isDeadCode = true;
@@ -869,7 +951,7 @@
 
             // For case-insesitive compares, non-ascii characters that have different
             // upper & lower case representations are converted to a character class.
-            ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(currentCharacter) || isCanonicallyUnique(currentCharacter));
+            ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(currentCharacter) || isCanonicallyUnique(currentCharacter, m_canonicalMode));
 
             allCharacters |= (currentCharacter << shiftAmount);
 
@@ -930,13 +1012,15 @@
         const RegisterID countRegister = regT1;
 
         move(index, countRegister);
-        sub32(Imm32(term->quantityMaxCount.unsafeGet()), countRegister);
+        Checked<unsigned> scaledMaxCount = term->quantityMaxCount;
+        scaledMaxCount *= U_IS_BMP(ch) ? 1 : 2;
+        sub32(Imm32(scaledMaxCount.unsafeGet()), countRegister);
 
         Label loop(this);
-        readCharacter(m_checkedOffset - term->inputPosition - term->quantityMaxCount, character, countRegister);
+        readCharacter(m_checkedOffset - term->inputPosition - scaledMaxCount, character, countRegister);
         // For case-insesitive compares, non-ascii characters that have different
         // upper & lower case representations are converted to a character class.
-        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch));
+        ASSERT(!m_pattern.ignoreCase() || isASCIIAlpha(ch) || isCanonicallyUnique(ch, m_canonicalMode));
         if (m_pattern.ignoreCase() && isASCIIAlpha(ch)) {
             or32(TrustedImm32(0x20), character);
             ch |= 0x20;
@@ -943,7 +1027,12 @@
         }
 
         op.m_jumps.append(branch32(NotEqual, character, Imm32(ch)));
-        add32(TrustedImm32(1), countRegister);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs && !U_IS_BMP(ch))
+            add32(TrustedImm32(2), countRegister);
+        else
+#endif
+            add32(TrustedImm32(1), countRegister);
         branch32(NotEqual, countRegister, index).linkTo(loop, this);
     }
     void backtrackPatternCharacterFixed(size_t opIndex)
@@ -969,8 +1058,18 @@
             failures.append(atEndOfInput());
             failures.append(jumpIfCharNotEquals(ch, m_checkedOffset - term->inputPosition, character));
 
+            add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+            if (m_decodeSurrogatePairs && !U_IS_BMP(ch)) {
+                Jump surrogatePairOk = notAtEndOfInput();
+                sub32(TrustedImm32(1), index);
+                failures.append(jump());
+                surrogatePairOk.link(this);
+                add32(TrustedImm32(1), index);
+            }
+#endif
             add32(TrustedImm32(1), countRegister);
-            add32(TrustedImm32(1), index);
+
             if (term->quantityMaxCount == quantifyInfinite)
                 jump(loop);
             else
@@ -980,7 +1079,7 @@
         }
         op.m_reentry = label();
 
-        storeToFrame(countRegister, term->frameLocation);
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex());
     }
     void backtrackPatternCharacterGreedy(size_t opIndex)
     {
@@ -991,10 +1090,13 @@
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        loadFromFrame(term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex(), countRegister);
         m_backtrackingState.append(branchTest32(Zero, countRegister));
         sub32(TrustedImm32(1), countRegister);
-        sub32(TrustedImm32(1), index);
+        if (!m_decodeSurrogatePairs || U_IS_BMP(term->patternCharacter))
+            sub32(TrustedImm32(1), index);
+        else
+            sub32(TrustedImm32(2), index);
         jump(op.m_reentry);
     }
 
@@ -1007,7 +1109,7 @@
 
         move(TrustedImm32(0), countRegister);
         op.m_reentry = label();
-        storeToFrame(countRegister, term->frameLocation);
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex());
     }
     void backtrackPatternCharacterNonGreedy(size_t opIndex)
     {
@@ -1020,7 +1122,7 @@
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        loadFromFrame(term->frameLocation + BackTrackInfoPatternCharacter::matchAmountIndex(), countRegister);
 
         // Unless have a 16 bit pattern character and an 8 bit string - short circuit
         if (!((ch > 0xff) && (m_charSize == Char8))) {
@@ -1030,13 +1132,27 @@
                 nonGreedyFailures.append(branch32(Equal, countRegister, Imm32(term->quantityMaxCount.unsafeGet())));
             nonGreedyFailures.append(jumpIfCharNotEquals(ch, m_checkedOffset - term->inputPosition, character));
 
+            add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+            if (m_decodeSurrogatePairs && !U_IS_BMP(ch)) {
+                Jump surrogatePairOk = notAtEndOfInput();
+                sub32(TrustedImm32(1), index);
+                nonGreedyFailures.append(jump());
+                surrogatePairOk.link(this);
+                add32(TrustedImm32(1), index);
+            }
+#endif
             add32(TrustedImm32(1), countRegister);
-            add32(TrustedImm32(1), index);
 
             jump(op.m_reentry);
             nonGreedyFailures.link(this);
         }
 
+        if (m_decodeSurrogatePairs && !U_IS_BMP(ch)) {
+            // subtract countRegister*2 for non-BMP characters
+            lshift32(TrustedImm32(1), countRegister);
+        }
+
         sub32(countRegister, index);
         m_backtrackingState.fallthrough();
     }
@@ -1048,6 +1164,9 @@
 
         const RegisterID character = regT0;
 
+        if (m_decodeSurrogatePairs)
+            storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
+
         JumpList matchDest;
         readCharacter(m_checkedOffset - term->inputPosition, character);
         matchCharacterClass(character, matchDest, term->characterClass);
@@ -1058,9 +1177,27 @@
             op.m_jumps.append(jump());
             matchDest.link(this);
         }
+
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
     }
     void backtrackCharacterClassOnce(size_t opIndex)
     {
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            YarrOp& op = m_ops[opIndex];
+            PatternTerm* term = op.m_term;
+
+            m_backtrackingState.link(this);
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
+            m_backtrackingState.fallthrough();
+        }
+#endif
         backtrackTermDefault(opIndex);
     }
 
@@ -1088,6 +1225,15 @@
         }
 
         add32(TrustedImm32(1), countRegister);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            op.m_jumps.append(atEndOfInput());
+            add32(TrustedImm32(1), countRegister);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
         branch32(NotEqual, countRegister, index).linkTo(loop, this);
     }
     void backtrackCharacterClassFixed(size_t opIndex)
@@ -1103,6 +1249,8 @@
         const RegisterID character = regT0;
         const RegisterID countRegister = regT1;
 
+        if (m_decodeSurrogatePairs)
+            storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
         move(TrustedImm32(0), countRegister);
 
         JumpList failures;
@@ -1122,6 +1270,15 @@
 
         add32(TrustedImm32(1), countRegister);
         add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            failures.append(atEndOfInput());
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
+
         if (term->quantityMaxCount != quantifyInfinite) {
             branch32(NotEqual, countRegister, Imm32(term->quantityMaxCount.unsafeGet())).linkTo(loop, this);
             failures.append(jump());
@@ -1131,7 +1288,7 @@
         failures.link(this);
         op.m_reentry = label();
 
-        storeToFrame(countRegister, term->frameLocation);
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
     }
     void backtrackCharacterClassGreedy(size_t opIndex)
     {
@@ -1142,10 +1299,34 @@
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
         m_backtrackingState.append(branchTest32(Zero, countRegister));
         sub32(TrustedImm32(1), countRegister);
-        sub32(TrustedImm32(1), index);
+        if (!m_decodeSurrogatePairs)
+            sub32(TrustedImm32(1), index);
+        else {
+            const RegisterID character = regT0;
+
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
+            // Rematch one less
+            storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
+
+            Label rematchLoop(this);
+            readCharacter(m_checkedOffset - term->inputPosition, character);
+
+            sub32(TrustedImm32(1), countRegister);
+            add32(TrustedImm32(1), index);
+
+#ifdef JIT_UNICODE_EXPRESSIONS
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+#endif
+
+            branchTest32(Zero, countRegister).linkTo(rematchLoop, this);
+
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
+        }
         jump(op.m_reentry);
     }
 
@@ -1158,8 +1339,11 @@
 
         move(TrustedImm32(0), countRegister);
         op.m_reentry = label();
-        storeToFrame(countRegister, term->frameLocation);
+        if (m_decodeSurrogatePairs)
+            storeToFrame(index, term->frameLocation + BackTrackInfoCharacterClass::beginIndex());
+        storeToFrame(countRegister, term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex());
     }
+
     void backtrackCharacterClassNonGreedy(size_t opIndex)
     {
         YarrOp& op = m_ops[opIndex];
@@ -1172,7 +1356,9 @@
 
         m_backtrackingState.link(this);
 
-        loadFromFrame(term->frameLocation, countRegister);
+        if (m_decodeSurrogatePairs)
+            loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::beginIndex(), index);
+        loadFromFrame(term->frameLocation + BackTrackInfoCharacterClass::matchAmountIndex(), countRegister);
 
         nonGreedyFailures.append(atEndOfInput());
         nonGreedyFailures.append(branch32(Equal, countRegister, Imm32(term->quantityMaxCount.unsafeGet())));
@@ -1190,6 +1376,13 @@
 
         add32(TrustedImm32(1), countRegister);
         add32(TrustedImm32(1), index);
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            Jump isBMPChar = branch32(LessThan, character, supplementaryPlanesBase);
+            add32(TrustedImm32(1), index);
+            isBMPChar.link(this);
+        }
+#endif
 
         jump(op.m_reentry);
 
@@ -1650,12 +1843,12 @@
                 //
                 // FIXME: for capturing parens, could use the index in the capture array?
                 if (term->quantityType == QuantifierGreedy)
-                    storeToFrame(index, parenthesesFrameLocation);
+                    storeToFrame(index, parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                 else if (term->quantityType == QuantifierNonGreedy) {
-                    storeToFrame(TrustedImm32(-1), parenthesesFrameLocation);
+                    storeToFrame(TrustedImm32(-1), parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                     op.m_jumps.append(jump());
                     op.m_reentry = label();
-                    storeToFrame(index, parenthesesFrameLocation);
+                    storeToFrame(index, parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                 }
 
                 // If the parenthese are capturing, store the starting index value to the
@@ -1731,7 +1924,7 @@
 
                 // Store the start index of the current match; we need to reject zero
                 // length matches.
-                storeToFrame(index, term->frameLocation);
+                storeToFrame(index, term->frameLocation + BackTrackInfoParenthesesTerminal::beginIndex());
                 break;
             }
             case OpParenthesesSubpatternTerminalEnd: {
@@ -1764,7 +1957,7 @@
                 // Store the current index - assertions should not update index, so
                 // we will need to restore it upon a successful match.
                 unsigned parenthesesFrameLocation = term->frameLocation;
-                storeToFrame(index, parenthesesFrameLocation);
+                storeToFrame(index, parenthesesFrameLocation + BackTrackInfoParentheticalAssertion::beginIndex());
 
                 // Check 
                 op.m_checkAdjust = m_checkedOffset - term->inputPosition;
@@ -1779,7 +1972,7 @@
 
                 // Restore the input index value.
                 unsigned parenthesesFrameLocation = term->frameLocation;
-                loadFromFrame(parenthesesFrameLocation, index);
+                loadFromFrame(parenthesesFrameLocation + BackTrackInfoParentheticalAssertion::beginIndex(), index);
 
                 // If inverted, a successful match of the assertion must be treated
                 // as a failure, so jump to backtracking.
@@ -2221,7 +2414,7 @@
                     if (term->quantityType == QuantifierGreedy) {
                         // Clear the flag in the stackframe indicating we ran through the subpattern.
                         unsigned parenthesesFrameLocation = term->frameLocation;
-                        storeToFrame(TrustedImm32(-1), parenthesesFrameLocation);
+                        storeToFrame(TrustedImm32(-1), parenthesesFrameLocation + BackTrackInfoParenthesesOnce::beginIndex());
                         // Jump to after the parentheses, skipping the subpattern.
                         jump(m_ops[op.m_nextOp].m_reentry);
                         // A backtrack from after the parentheses, when skipping the subpattern,
@@ -2590,12 +2783,46 @@
         lastOp.m_nextOp = repeatLoop;
     }
 
+    void generateTryReadUnicodeCharacterHelper()
+    {
+#ifdef JIT_UNICODE_EXPRESSIONS
+        if (m_tryReadUnicodeCharacterCalls.isEmpty())
+            return;
+
+        ASSERT(m_decodeSurrogatePairs);
+
+        m_tryReadUnicodeCharacterEntry = label();
+
+        tryReadUnicodeCharImpl(regT0);
+
+        ret();
+#endif
+    }
+
     void generateEnter()
     {
 #if CPU(X86_64)
         push(X86Registers::ebp);
         move(stackPointerRegister, X86Registers::ebp);
-        push(X86Registers::ebx);
+
+        if (m_pattern.m_saveInitialStartValue)
+            push(X86Registers::ebx);
+
+        if (m_decodeSurrogatePairs) {
+#if OS(WINDOWS)
+            push(X86Registers::edi);
+            push(X86Registers::esi);
+#endif
+            push(X86Registers::r12);
+            push(X86Registers::r13);
+            push(X86Registers::r14);
+            push(X86Registers::r15);
+
+            move(TrustedImm32(0x10000), supplementaryPlanesBase);
+            move(TrustedImm32(0xfffffc00), surrogateTagMask);
+            move(TrustedImm32(0xd800), leadingSurrogateTag);
+            move(TrustedImm32(0xdc00), trailingSurrogateTag);
+        }
         // The ABI doesn't guarantee the upper bits are zero on unsigned arguments, so clear them ourselves.
         zeroExtend32ToPtr(index, index);
         zeroExtend32ToPtr(length, length);
@@ -2622,6 +2849,14 @@
             loadPtr(Address(X86Registers::ebp, 2 * sizeof(void*)), output);
     #endif
 #elif CPU(ARM64)
+        if (m_decodeSurrogatePairs) {
+            pushPair(framePointerRegister, linkRegister);
+            move(TrustedImm32(0x10000), supplementaryPlanesBase);
+            move(TrustedImm32(0xfffffc00), surrogateTagMask);
+            move(TrustedImm32(0xd800), leadingSurrogateTag);
+            move(TrustedImm32(0xdc00), trailingSurrogateTag);
+        }
+
         // The ABI doesn't guarantee the upper bits are zero on unsigned arguments, so clear them ourselves.
         zeroExtend32ToPtr(index, index);
         zeroExtend32ToPtr(length, length);
@@ -2647,7 +2882,19 @@
         store64(returnRegister2, Address(X86Registers::ecx, sizeof(void*)));
         move(X86Registers::ecx, returnRegister);
 #endif
-        pop(X86Registers::ebx);
+        if (m_decodeSurrogatePairs) {
+            pop(X86Registers::r15);
+            pop(X86Registers::r14);
+            pop(X86Registers::r13);
+            pop(X86Registers::r12);
+#if OS(WINDOWS)
+            pop(X86Registers::esi);
+            pop(X86Registers::edi);
+#endif
+        }
+
+        if (m_pattern.m_saveInitialStartValue)
+            pop(X86Registers::ebx);
         pop(X86Registers::ebp);
 #elif CPU(X86)
         pop(X86Registers::esi);
@@ -2654,6 +2901,9 @@
         pop(X86Registers::edi);
         pop(X86Registers::ebx);
         pop(X86Registers::ebp);
+#elif CPU(ARM64)
+        if (m_decodeSurrogatePairs)
+            popPair(framePointerRegister, linkRegister);
 #elif CPU(ARM)
         pop(ARMRegisters::r6);
         pop(ARMRegisters::r5);
@@ -2670,11 +2920,21 @@
         , m_pattern(pattern)
         , m_charSize(charSize)
         , m_shouldFallBack(false)
+        , m_decodeSurrogatePairs(m_charSize == Char16 && m_pattern.unicode())
+        , m_unicodeIgnoreCase(m_pattern.unicode() && m_pattern.ignoreCase())
+        , m_canonicalMode(m_pattern.unicode() ? CanonicalMode::Unicode : CanonicalMode::UCS2)
     {
     }
 
     void compile(YarrCodeBlock& jitObject)
     {
+#ifndef JIT_UNICODE_EXPRESSIONS
+        if (m_decodeSurrogatePairs) {
+            jitObject.setFallBack(true);
+            return;
+        }
+#endif
+
         generateEnter();
 
         Jump hasInput = checkInput();
@@ -2709,6 +2969,8 @@
         generate();
         backtrack();
 
+        generateTryReadUnicodeCharacterHelper();
+
         LinkBuffer linkBuffer(*this, REGEXP_CODE_ID, JITCompilationCanFail);
         if (linkBuffer.didFailToAllocate()) {
             jitObject.setFallBack(true);
@@ -2715,6 +2977,13 @@
             return;
         }
 
+        if (!m_tryReadUnicodeCharacterCalls.isEmpty()) {
+            CodeLocationLabel tryReadUnicodeCharacterHelper = linkBuffer.locationOf(m_tryReadUnicodeCharacterEntry);
+
+            for (auto call : m_tryReadUnicodeCharacterCalls)
+                linkBuffer.link(call, tryReadUnicodeCharacterHelper);
+        }
+
         m_backtrackingState.linkDataLabels(linkBuffer);
 
         if (compileMode == MatchOnly) {
@@ -2742,6 +3011,12 @@
     // supported in the JIT; fall back to the interpreter when this is detected.
     bool m_shouldFallBack;
 
+    bool m_decodeSurrogatePairs;
+    bool m_unicodeIgnoreCase;
+    CanonicalMode m_canonicalMode;
+    Vector<Call> m_tryReadUnicodeCharacterCalls;
+    Label m_tryReadUnicodeCharacterEntry;
+
     // The regular _expression_ expressed as a linear sequence of operations.
     Vector<YarrOp, 128> m_ops;
 

Modified: trunk/Source/_javascript_Core/yarr/YarrJIT.h (221051 => 221052)


--- trunk/Source/_javascript_Core/yarr/YarrJIT.h	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/yarr/YarrJIT.h	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009 Apple Inc. All rights reserved.
+ * Copyright (C) 2009-2017 Apple Inc. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions

Modified: trunk/Source/_javascript_Core/yarr/YarrPattern.cpp (221051 => 221052)


--- trunk/Source/_javascript_Core/yarr/YarrPattern.cpp	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/yarr/YarrPattern.cpp	2017-08-22 22:43:08 UTC (rev 221052)
@@ -45,6 +45,7 @@
 public:
     CharacterClassConstructor(bool isCaseInsensitive, CanonicalMode canonicalMode)
         : m_isCaseInsensitive(isCaseInsensitive)
+        , m_hasNonBMPCharacters(false)
         , m_canonicalMode(canonicalMode)
     {
     }
@@ -55,6 +56,7 @@
         m_ranges.clear();
         m_matchesUnicode.clear();
         m_rangesUnicode.clear();
+        m_hasNonBMPCharacters = false;
     }
 
     void append(const CharacterClass* other)
@@ -185,6 +187,7 @@
         characterClass->m_ranges.swap(m_ranges);
         characterClass->m_matchesUnicode.swap(m_matchesUnicode);
         characterClass->m_rangesUnicode.swap(m_rangesUnicode);
+        characterClass->m_hasNonBMPCharacters = hasNonBMPCharacters();
 
         return characterClass;
     }
@@ -200,6 +203,9 @@
         unsigned pos = 0;
         unsigned range = matches.size();
 
+        if (!U_IS_BMP(ch))
+            m_hasNonBMPCharacters = true;
+
         // binary chop, find position to insert char.
         while (range) {
             unsigned index = range >> 1;
@@ -224,7 +230,10 @@
     void addSortedRange(Vector<CharacterRange>& ranges, UChar32 lo, UChar32 hi)
     {
         unsigned end = ranges.size();
-        
+
+        if (!U_IS_BMP(hi))
+            m_hasNonBMPCharacters = true;
+
         // Simple linear scan - I doubt there are that many ranges anyway...
         // feel free to fix this with something faster (eg binary chop).
         for (unsigned i = 0; i < end; ++i) {
@@ -266,7 +275,13 @@
         ranges.append(CharacterRange(lo, hi));
     }
 
+    bool hasNonBMPCharacters()
+    {
+        return m_hasNonBMPCharacters;
+    }
+
     bool m_isCaseInsensitive;
+    bool m_hasNonBMPCharacters;
     CanonicalMode m_canonicalMode;
 
     Vector<UChar32> m_matches;
@@ -617,7 +632,11 @@
                     currentCallFrameSize += YarrStackSpaceForBackTrackInfoPatternCharacter;
                     alternative->m_hasFixedSize = false;
                 } else if (m_pattern.unicode()) {
-                    currentInputPosition += U16_LENGTH(term.patternCharacter) * term.quantityMaxCount;
+                    Checked<unsigned, RecordOverflow> tempCount = term.quantityMaxCount;
+                    tempCount *= U16_LENGTH(term.patternCharacter);
+                    if (tempCount.hasOverflowed())
+                        return YarrPattern::OffsetTooLarge;
+                    currentInputPosition += tempCount;
                 } else
                     currentInputPosition += term.quantityMaxCount;
                 break;

Modified: trunk/Source/_javascript_Core/yarr/YarrPattern.h (221051 => 221052)


--- trunk/Source/_javascript_Core/yarr/YarrPattern.h	2017-08-22 21:54:59 UTC (rev 221051)
+++ trunk/Source/_javascript_Core/yarr/YarrPattern.h	2017-08-22 22:43:08 UTC (rev 221052)
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2009, 2013-2014, 2016 Apple Inc. All rights reserved.
+ * Copyright (C) 2009, 2013-2017 Apple Inc. All rights reserved.
  * Copyright (C) 2010 Peter Varga ([email protected]), University of Szeged
  *
  * Redistribution and use in source and binary forms, with or without
@@ -56,11 +56,13 @@
     // specified matches and ranges)
     CharacterClass()
         : m_table(0)
+        , m_hasNonBMPCharacters(false)
     {
     }
     CharacterClass(const char* table, bool inverted)
         : m_table(table)
         , m_tableInverted(inverted)
+        , m_hasNonBMPCharacters(false)
     {
     }
     Vector<UChar32> m_matches;
@@ -70,6 +72,7 @@
 
     const char* m_table;
     bool m_tableInverted;
+    bool m_hasNonBMPCharacters;
 };
 
 enum QuantifierType {
@@ -493,4 +496,52 @@
     CharacterClass* nonwordUnicodeIgnoreCasecharCached;
 };
 
+    struct BackTrackInfoPatternCharacter {
+        uintptr_t begin; // Only needed for unicode patterns
+        uintptr_t matchAmount;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoPatternCharacter, begin) / sizeof(uintptr_t); }
+        static unsigned matchAmountIndex() { return offsetof(BackTrackInfoPatternCharacter, matchAmount) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoCharacterClass {
+        uintptr_t begin; // Only needed for unicode patterns
+        uintptr_t matchAmount;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoCharacterClass, begin) / sizeof(uintptr_t); }
+        static unsigned matchAmountIndex() { return offsetof(BackTrackInfoCharacterClass, matchAmount) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoBackReference {
+        uintptr_t begin; // Not really needed for greedy quantifiers.
+        uintptr_t matchAmount; // Not really needed for fixed quantifiers.
+
+        unsigned beginIndex() { return offsetof(BackTrackInfoBackReference, begin) / sizeof(uintptr_t); }
+        unsigned matchAmountIndex() { return offsetof(BackTrackInfoBackReference, matchAmount) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoAlternative {
+        uintptr_t offset;
+
+        static unsigned offsetIndex() { return offsetof(BackTrackInfoAlternative, offset) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoParentheticalAssertion {
+        uintptr_t begin;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoParentheticalAssertion, begin) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoParenthesesOnce {
+        uintptr_t begin;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoParenthesesOnce, begin) / sizeof(uintptr_t); }
+    };
+
+    struct BackTrackInfoParenthesesTerminal {
+        uintptr_t begin;
+
+        static unsigned beginIndex() { return offsetof(BackTrackInfoParenthesesTerminal, begin) / sizeof(uintptr_t); }
+    };
+
 } } // namespace JSC::Yarr
_______________________________________________
webkit-changes mailing list
[email protected]
https://lists.webkit.org/mailman/listinfo/webkit-changes

Reply via email to