DO NOT REPLY [Bug 7806] New: - Non-BMP Unicode block names in regexes

bugzilla Sat, 06 Apr 2002 19:08:00 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7806>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=7806

Non-BMP Unicode block names in regexes

           Summary: Non-BMP Unicode block names in regexes
           Product: Xerces2-J
           Version: 2.0.0
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: XML Schema datatypes
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


There's a bug with handling the Unicode block names that are outside the BMP (i.e. 
with codes > 0xFFFF).  Something like \p{IsGothic} doesn't work as it should.

 The bug is in org.apache.xerces.impl.xpath.regex.Token.  In the declaration of 
blockNames, there's a comment:

         //missing Specials add manually

But it doesn't do this. The blockRanges string includes things like \u10300\u1032F 
which is completely bogus, since \u only takes 4 hex digits.

The fix is to add a table of non-BMP block ranges

static final int[] nonBmpBlockRanges = { 0x10330, 0x1032F, ... };

Then in Token.getRange(), do addRange for each of the ranges in 
nonBmpBlockRanges.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 7806] New: - Non-BMP Unicode block names in regexes

Reply via email to