https://bugzilla.wikimedia.org/show_bug.cgi?id=46773

--- Comment #11 from Antoine "hashar" Musso <[email protected]> ---
Created attachment 12734
  --> https://bugzilla.wikimedia.org/attachment.cgi?id=12734&action=edit
PCRE unit tests without and with unicode mode

The root cause is that PCRE does not look up unicode characters properties by
default and would not recognize word boundaries in various scripts.

To make PCRE matches the word boundaries, we need to have PCRE act in unicode
mode using the 'u' regex modifiers.  That will make PCRE to lookup the
character properties in a huge table which might be a bit slow.

So that is definitely doable, but we have to look at the performance impact.


The change https://gerrit.wikimedia.org/r/71718 adds a lame test in MediaWiki
core which shows the problem.


$ php phpunit.php --testdox includes/bug46773Test.php 
PHPUnit 3.7.21 by Sebastian Bergmann.

Configuration read from
/Users/amusso/projects/mediawiki/core/tests/phpunit/suite.xml

bug46773
 [ ] Regex boundaries devanagari
 [x] Regex boundaries devanagari in unicode mode
 [x] Media wiki test case parent setup called
$

(a 'x' denote test is passing).


Attached is the --tap output of the test.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to